<<

Contents

LIST OF PAPERS ...... 3 LISTS OF ACRONYMS AND ABBREVIATIONS ...... 4 INTRODUCTION ...... 7 BACKGROUND ...... 7 AIMS OF THE THESIS ...... 9 METHODOLOGY ...... 11 QUANTITATIVE STRUCTURE-PROPERTY RELATIONSHIPS (QSPR)...... 11 Linear Free Energy Relationships (LFER)...... 11 Linear Solvation Energy Relationship (LSER)...... 12 Reversed-Phase Liquid Chromatography (RPLC)...... 14 Quantitative Structure-Retention Relationships (QSRR) ...... 14 Global Linear Solvation Energy Relationship (LSER)...... 15 Molecular Descriptors ...... 16 The Definition of the Applicability Domain ...... 17 The Distance-Based Approach Using PLS...... 17 The Williams Plot ...... 18 Associative Neural Networks (ASNN)...... 19 DATA ANALYSIS ...... 20 Multiple Linear Regression (MLR)...... 20 Principal Component Analysis (PCA) ...... 20 Partial Least Squares Projection to Latent Structures (PLS) ...... 21 Neural Networks...... 24 PHYSICOCHEMICAL PROPERTIES...... 27 VAPOR PRESSURE (P) ...... 27

PARTITIONING COEFFICIENT N-OCTANOL/WATER (KOW) ...... 27

WATER SOLUBILITY (SW) ...... 29 MELTING POINT ...... 30 BOILING POINT ...... 30 PERFLUORINATED COMPOUNDS ...... 31 RESULTS AND DISCUSSION...... 33 PREDICTION OF VAPOR PRESSURE ...... 33 COMPARISON OF MODELING OF ENVIRONMENTAL PROPERTIES BETWEEN LSER AND PLS...... 34 Prediction of Partitioning Coefficient (Kow)...... 35 Prediction of Water Solubility (Sw)...... 37 Comparison of Models - LSER and PLS ...... 38

1 EXTENSION OF A PREDICTIVE MODEL FOR VAPOR PRESSURE...... 39 QSPR MODELS OF PERFLUORINATED CHEMICALS (PFCS)...... 40 QSPR Models with Melting Point ...... 41 QSPR Models with Boiling Point ...... 42 Applicability Domain...... 43 QSRR MODELS ...... 44 Global LSER Models ...... 45 QSRR Models Using Semi-Empirical Descriptors...... 46 QSRR Models Using Theoretical Molecular Descriptors ...... 46 CONCLUSIONS...... 49 ACKNOWLEDGEMENTS ...... 51 REFERENCES...... 52

2 List of Papers

This thesis is based on the following papers, referred to in the text by roman numerals.

I. Öberg, T., Liu, T. (2008): Global and local PLS model to predict vapor pressure. QSAR & Combinatorial Science, 27(3), 273-279.

II. Liu, T., Öberg, T. (2009): Modeling of partition constants: linear solvation energy relationships or PLS regression? Journal of , 23(5-6), 254-262.

III. Öberg, T., Liu, T. (2010): Extension of a prediction model to estimate vapor pressure of perfluorinated compounds (PFCs). Submitted to Chemometrics and Intelligent Laboratory Systems.

IV. Bhhatarai, B,. Teetz. W., Liu, T., Öberg, T., Jeliazkova. N., Kochev, N., Pukalov, O., Tetko, I.V., Kovarich, S., Gramatica, P. (2010): CADASTER QSPR models for predictions of melting and boiling points of perfluorinated chemicals. Manuscript.

V. Liu, T., Nicholls, I.A., Öberg, T. (2010): Comparison between theoretical and experimental models for characterizing solvent properties using reversed-phase liquid chromatography (RPLC). Submitted to Journal of Chromatography A.

Paper I and paper II are reprinted with kind permissions of Wiley-VCH Verlag GmbH & Co. KGaA. and John Wiley & Sons Ltd., respectively.

3 Lists of Acronyms and Abbreviations

AD Applicability domain ADME Absorption, distribution, metabolism and excretion AFC Atom/fragment contribution ASNN Associative neural network BP Boiling point CADASTER CAse studies on the Development and Application of in-Silico Techniques for Environmental hazard and Risk assessment CV Cross-validation ECHA European Chemicals Agency EINECS European Inventory of Existing Commercial Chemical Substances EPA Environmental Protection Agency EU European Union FTOH Fluorotelomer alcohols GA Genetic GC Gas chromatography GETAWAY Geometry, topology, and atom-weights assembly GLC Gas-liquid chromatography HBA Hydrogen bond acceptor HBD Hydrogen bond donor HPLC High-performance liquid chromatography HTS High-throughput screening Kow n-Octanol/water LFER Linear free energy relationship LSER Linear solvation energy relationship LSST Linear solvent strength theory MLR Multiple linear regression MP Melting point NIPALS Nonlinear iterative partial least squares OECD Organisation for Economic Co-operation and Development OLS Ordinary-least-squares Pa Vapor pressure PL Liquid vapor pressure PS Solid vapor pressure PAHs Polycyclic aromatic hydrocarbons PCA Principal component analysis PCBs Polychlorinated biphenyls PFAs Perfluorinated acids

4 PFAS Perfluoroalkyl sulfonates PFCA Perfluorinated carboxylic acids PFCs Perfluorinated compounds PFOA Perfluorooctanoate PLS Partial least squares projection to latent structures PRESS Predictive residual sum of square QSAR Quantitative structure-activity relationship QSPR Quantitative structure-property relationship QSRR Quantitative structure-retention relationship R Gas constant REACH Registration, evaluation, authorization and restriction of chemicals RMSEC Root mean square error of calibration RMSEE Root mean square error of estimation RMSEP Root mean square error of prediction RPLC Reversed-phase liquid chromatography RSD Relative standard deviation RSS Residual sum of square SD Standard deviation SMILES Simplified molecular input line entry system SNN Supervised neural networks SOM Self organizing maps SPI Snyder polarity index SS Sum of the square Sw Water solubility T Absolute temperature UNIFAC Universal functional activity coefficient UNN Unsupervised neural networks UV/VIS Ultraviolet/visible VIP Variable importance in the projection WHIM Weighted holistic invariant molecular

5

6 Introduction

Background is creative and productive, generating the substances of high value from inexpesive raw materials. Chemistry and chemical industries help provide comfortable and modern living conditions. At the same time, many chemicals partition to the air, soil and water environments during their manufacture, transport and application. Today, approximately 143,000 substances are pre- registered by the European Chemicals Agency (ECHA), and 100,204 substances are listed in the European Inventory of Existing Commercial Chemical Substances (EINECS), 80,000 of which are thought to be used in the European Union (EU) (Brown, 2003). Thousands of chemicals currently used in the EU have not undergone risk assessment for their effect on the environment or on human health. It is therefore necessary to develop rapid and reliable methods of predicting environmental characteristics; in this thesis, we attempt to predict physicochemical properties based on the principles of chemoinformatics. With an explosion in the amount of data generated from combinational chemistry and high-throughput screening (HTS) in , together with the rapid progression in computing technology, scientists have applied computing skills and knowledge of chemistry to the data analysis and accelerated the development of chemoinformatics (Schofield et al. 2001). The term was first presented at the University of Utrecht in the Netherlands in 1976 (Chen, 2006), and it rapidly became a popular term in the middle of the 1990s with a number of publications that provided excellent overviews of the fields. At the same time, a counterpart term, chemoinformatics, began to appear in chemistry. There is no clear delimitation between bioinformatics and chemoinformatics (Hann and Green, 1999); however, bioinformatics mainly deals with biological data such as DNA sequence, protein structure, metabolic pathway information, clinical trial data, and other larger molecules, while chemoinformatics focuses more on small molecules such as chemical structure, spectral, HTS, medicinal and environmental property data (Bunin et al., 2007). The terms, cheminformatics, chemical informatics, are also used in publications. However, chemoinformatics is the more frequent term in the academic literature (Willett, 2008). There is no true definition of chemoinformatics due to its highly interdisciplinary characteristics (Brown, 2009). Chemoinformatics is the application of informatics tools to chemical problems, together with particular emphasis on the manipulation of chemical structure information (Gasteiger and Engel, 2003; Gasteiger, 2006; Gasteiger, 2006; Leach and Gillet, 2003).

7 Chemoinformatics covers many areas, including molecular modeling, quantitative structure-property relationship (QSPR), data processing and data analysis, visualizations, and database management. Frank Brown gave the first formal definition of chemoinformatics in (Brown, 2005; Brown, 1998): Chemoinformatics is the mixing of those information resources to transform data into information and information into knowledge, for the intended purpose of making better decisions faster in the area of drug lead identification and optimization. Whereas Greg Paris put forward a more general definition of chemoinformatics at the meeting of the American Chemical Society in 1999: Chemoinformatics is a generic term that encompasses design, creation, organization, management, retrieval, analysis, dissemination, visualization, and the use of the chemical information. The methods of chemoinformatics have been applied in pharmaceutical research for many years, to search for new lead structures and to optimize them to drug candidates. However, quantitative structure-activity relationship (QSAR) models have found extensive application in environmental research to predict adverse effects of chemicals, and to reliably estimate their characteristics and the persistence of the compounds in the environment. A limitation is that many models lack a definition of the domain of applicability and have not been properly validated. As many chemicals in their manufacture and use show potentially negative impacts on the environment and human health, the basic requirement is that the risk should be eliminated or reduced to an acceptable level. The term green chemistry as an effective approach for the protection of the environment was coined at the US Environmental Protection Agency (EPA) in the 1990s (Horton, 1999), and was defined based on 12 principles by Paul Anastas and John Warner (Anastas and Warner, 1998). Green chemistry is the utilization of chemical principles to reduce or eliminate the use and generation of hazardous substances, solvents, and by-products, and to maximize the desired products. It seeks to reduce and prevent pollution at its source, and it is a highly effective methodology to reduce negative effects on human health and environmental damage. Sustainable development for improving efficiency, economic, and environmental performance encourages the development of green chemistry today (Clark, 1999), for this , green chemistry is also termed sustainable chemistry (Woodhouse and Breyman, 2005). The new European chemical legislation on the Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) improves protection of the environment and human health, while at the same time encouraging scientists to focus on the design and development of safer products using the principles of green chemistry (Clark, 2006).

8 There are various analytical methods for the measurement of pollutant chemicals; as an efficient analytical approach, chromatography is widely used for many polar or nonpolar chemicals. At the beginning of 20th century Russian botanist Mikhail Semyonovich Tswett discovered chromatography (Ettre and Sakodynskii, 1993; Ettre, 1996). The work of Archer John Martin and Richard Laurence Millington Synge represented a significant milestone, and they received the Nobel Prize in 1952. Since then, the technology has advanced rapidly. Today, chromatography has evolved into a large number of applied methods for available various chemicals, and high-performance liquid chromatography (HPLC) methods account for about two-thirds of all the reported separations due to its exceptional versatility and simplicity of operation (Pool, 2003). The complexity and diversity of chemical data require powerful and diversified data analysis methods. Herman Wold developed multivariate modeling with fixed point and nonlinear iterative partial least squares (NIPALS) around 1964 (Wold, 2001), publishing his first paper in 1975 (Hoskuldsson, 2001). In 1983, the partial least squares projection to latent structures (PLS) model with two blocks X and Y was slightly modified by Svante Wold and Harald Martens to better suit data from the natural sciences (Lindberg et al., 1983). The chemometrics approach extracts relevant chemical information from the data generated from chemical experiments or theoretical molecular descriptors. The main chemical information can be expressed as a mathematical relation (Wold, 1995), in which the multivariate methods principal components analysis (PCA) and PLS based on projection were effectively used for developing the model.

Aims of the Thesis The aims of this thesis are as follows:

I. Development and validation of a global QSPR model for the vapor pressure (Pa) of organic compounds at ambient temperatures, and using PLS regression to predict the distribution and transport of different compounds in the environment; development of the local model to improve the global model, as well as to overcome the problems in nonlinearities (paper I).

II. Development of QSPR models for key environmental properties, e.g., partition coefficients (Kow), water solubility (Sw). Investigation of the similarities and differences between linear solvation energy relationship (LSER) models using semi-empirical descriptors and PLS models using theoretical molecular descriptors. This work focused on comparing

9 model dimensions, structure, predictive power performance, and the questions of local and global applicability (paper II).

III. Updating an existing global QSPR model with a few new compounds representing a specific compound group, perfluorinated compounds (PFCs). The prediction performance was evaluated, and the updated model was subsequently applied to a larger set of environmental relevant PFCs (paper III).

IV. Development of QSPR models for physicochemical properties of perfluorinated compounds (PFCs), e.g., melting point, and boiling point. Several linear, nonlinear and fragment based approaches were used for QSPR modeling. The predictive performance was verified on a blind external validation set. Different modeling approaches were applied and compared (paper IV).

V. Development of experimental and theoretical quantitative structure retention relationship (QSRR) models for characterizing solvent properties in three different stationary phases in reversed-phase liquid chromatography (RPLC). The similarities and differences of the QSRR models based upon empirical, semi-empirical and theoretical molecular descriptors were compared and optimized. The retention factors of new solvents can be accurately predicted by the models (paper V).

10 Methodology

Quantitative Structure-Property Relationships (QSPR) The chemical molecular descriptors can be used to form a mathematical relationship with quantitatively biological activities as a certain response, and the mathematical expression can be used to predict the biological response from other chemical descriptors. This mathematical expression is referred to as a quantitative structure-activity relationship (QSAR). When a mathematical relationship is established between the chemical molecular descriptors and physicochemical properties, this is termed a quantitative structure-property relationship (QSPR). Once the model is established, the physicochemical properties for new chemicals can be predicted by the developed model. In the latter part of nineteenth century, scientists focused mainly on the relationship between chemical structure and bioactivity (Crum Brown and Frazer, 1868). In 1935, Hammett developed a relationship between the rate of a reaction and the equilibrium constant of a group of related reactions, and pointed out a certain limitation to its application (Hammett, 1935). Taft created an approach to separate polar, steric, and resonance effects and introduced the first steric parameter, Es (Taft, 1952). Building on these, the pioneering work of Hansch and coworkers established in 1962 a QSAR model for a series of plant growth regulators with hydrophobic and electronic properties (Hansch et al., 1962). Since then, the QSAR with the effects of substitutents on the biological response of the molecules described by electronic, steric, and hydrophobic properties has begun to be applied to agrochemistry, pharmaceutical chemistry, and toxicology (Hansch, 1969; Kurup et al., 2001). The application of QSAR to environmental chemistry and toxicology increased steadily in the 1980s as a result of the sudden awareness of the huge number of compounds present in the environment (Dunn, 1989). Linear Free Energy Relationships (LFER) Physicochemical properties have been used to predict environmental partition and transport of organic compounds. The principle behind the most common approaches for predicting partition constants is to express the free energy of transfer. The effects of enthalpy (ΔH) and entropy (ΔS) resulting from changes in molecular interactions contribute to the free energy of transfer:

12 i Δ=Δ 12 i − Δ12STHG i (1)

11 The intermolecular forces govern the partition of organic compounds between different phases, and Δ12Gi as free energy of transfer between different phases can predict partition constants or partition coefficients at equilibrium. The partition constant is defined as (Schwarzenbach et al., 2003):

Δ− 12 i / RTG K12 ⋅= econstant (2)

where R is the general gas constant, the unit is 8.314510 J·mol-1K-1, and T is absolute temperature. The logarithm of an equilibrium constant under constant temperature and pressure is proportional to a standard free energy of transfer, and the logarithm of a rate constant forms a linear function of the free energy of activation. LFER expresses a linear relationship between the logarithm of a rate constant or equilibrium constant for one series of reactions and the logarithm of the rate constant or equilibrium constant for a related series of reactions. The pioneering works on this relationship include the following: the Brønsted relation established protonation of bases through the de-protonation of acids, the Hammett equation solved the effect of a substituent in aromatic molecules (Hammett, 1937), and the Taft relation established polar and steric substituent for aliphatic molecules (Taft, 1952). The basic assumption of LFER in this thesis is that the van der Waals force and polar interactions mainly contribute to the free energy of transfer:

vdW polar 12 i 12 i Δ+Δ=Δ 12GGG i (3)

vdW Where Δ12Gi encompasses dispersive, dipole-dipole, and dipole-induced dipole contributions. Hydrogen bonding donor and hydrogen bonding polar acceptor descriptors contribute to quantification of Δ12Gi . Linear Solvation Energy Relationship (LSER) LSER is a specific subset of a thermodynamic LFER. The LSER equation expresses the application of solvent parameters in linear or multiple linear regression (MLR), and it shows the solvent effect on the reaction rate or equilibrium constant of a chemical reaction. Kamlet and Taft first devised an α scale of solvent hydrogen bond donor (HBD) acidity in 1976 (Taft and Kamlet, 1976) and a β scale of solvent hydrogen bond acceptor (HBA) basicity (Kamlet and Taft, 1976), then developed a π* scale of solvent dipolarity/polarizability, generating a linear relationship between an equilibrium constant or reaction rate and a parameter α, β, π*. These are well known as Kamlet-Taft solvatochromic parameters (Kamlet et al., 1977). Kamlet-Taft scales were generated from a solvotochromic comparison approach, and are not related to any thermodynamic properties. In order to construct a correlation equation that has a sound physical interpretation, it is crucial that the various descriptors should be related to Gibbs free energy. To

12 meet this requirement, Abraham devised the construction of scales of hydrogen bond acidity and hydrogen bond basicity with 1:1 hydrogen bond complexation constants in an inert solvent, tetrachloromethane, together with other three parameters, and developed the equation of the linear solvation energy relationship (LSER) (Abraham, 1993a). In the most recent work, the symbolic representation of the LSER model is commonly presented as follow (Abraham et al., 2004):

log += + + + + vVbBaAsSeEcSP (4)

where, SP is a physicochemical property of a series of solutes in a given system. If SP as a response represents a biological property for a series of solutes, this mathematical expression becomes a QSAR equation. E is an excess molar refraction of the compounds, and is derived from the refractive index function; E is the polarizability of a solute above that of a n- alkane at the same molar volume, and can be easily computed as (Abraham et al., 2004):

= ()MRE x − Vx + 52553.083195.2 (5)

where Vx is a molar volume, MRx is the molar refraction of a solute, and E expresses an indication of the interaction of a compound between solvent and solute via π and n electron pairs. The unit of E is cm3mol-1/10. The value of E can be negative for compounds which are less polarizable than n-alkanes, e.g., fluorocarbons and organosilicon compounds. S is molecular dipolarity/polarizability; the values of non-polar compounds can be generated from gas-liquid chromatography (GLC), and the values of polar compounds can be obtained from water-solvent partition coefficients. The value of S can be negative for the compounds that are less polarizable than n-alkanes, e.g., fluorocarbons and organosilicon compounds. A is hydrogen bond donor (HBD) or hydrogen bond acidity; HBD describes the donation of a proton to a neighboring molecule. Initially, HBD descriptors for mono-acids can be directly generated from hydrogen bond complexation constant (Abraham et al., 1989). B is hydrogen bond acceptor (HBA) or hydrogen bond basicity; HBA describes the acceptance of a proton from a neighboring molecule. HBA descriptors for mono-bases were originally obtained from hydrogen bond complexation constants (Abraham et al., 1990); further values for bases can be generated from partition coefficients. So far we have obtained several hundred HBA values (Abraham, 1993a). And the values of further compounds for S, A and B can be generated from chromatography measurement (Poole et al., 2009). V is a molar volume, termed the McGowan volume, which can be calculated from atomic fragments and the numbers of the bonds in a molecule,

13 that is, by summing up all the atom volumes and subtracting 6.56 cm3mol-1 for each bond, no matter whether single, double, or triple bond (Abraham and McGowan, 1987). [ ()AAC ( 156.6 +−− RN )] V = ∑ g (6) 100

where AAC is all the atom contribution, N is the total number of atoms, Rg is the total number of ring structures (Zhao et al., 2003). The unit of V is cm3mol-1/100. Reversed-Phase Liquid Chromatography (RPLC) Compared to gas chromatography (GC) liquid chromatography is used far less for physicochemical properties, insufficient knowledge of the true composition of the stationary phase and a lack of quantitative models for the accurate description of retention are the principal reason. However, as the most popular of the liquid chromatographic method, reversed-phase liquid chromatography (RPLC) is widely used to include all separation that employs a nonpolar stationary phase and a polar mobile phase. The stationary phase is a liquid film coated on a packing material consisting of 3-10 μm porous silica particles, and the stationary phase is covalently bound to the silica particle. The bonded stationary phase is attached by reacting the silica particles with organochlorosilane Si(CH3)RCl, in which R is a straight alkyl group, such as C18H37 or C8H17. The presence of water in the mobile phase plays an important role in establishing the hallmark feature of separation (Poole, 2003). A decrease in polarity of the mobile phase or an increase of the volume fraction of organic solvent in an aqueous mixture, leads to a decrease in retention. Retention prediction and selectivity are crucial for the development of separation. Selectivity can be varied in many ways, such as, the type of organic modifier, volume fraction, or the type of stationary phase. A free energy transfer of a solute between two phases can be described by LSER. Quantitative Structure-Retention Relationships (QSRR) Quantitative structure-retention relationship (QSRR) resulted from applying the methodology used for the studies of QSAR to chromatographic data, and the first publications were appeared in 1977 (Kaliszan, 1977; Kaliszan and Foks, 1977). Chromatographic retention factors are linearly related to the free energy transfer associated with the chromatographic distribution process (Kaliszan, 1992). Chromatography can readily yield a large number of accurate, precise and reproducible data for large sets of structurally diversified compounds. If the operational conditions are kept constant, solute structures as independent variables are linearly related with retention factors, this relationship is defined as QSRR.

14 When chemometrics approaches were introduced into QSRR analysis, much effort has been devoted to search for predictive solvent characteristics. Global Linear Solvation Energy Relationship (LSER) Retention prediction and selectivity optimization are crucial for rapid method development in RPLC (Snyder et al. 1997). However, retention is also a complex process and depends on many physical and chemical properties within chromatographic system such as, stationary phase characteristics and mobile phase compositions (Poole, 2003). A universal and robust retention model for RPLC has not been developed, but many retention models such as linear solvation energy relationships (LSER) and linear solvent strength theory (LSST) are still widely used. Retention related to the free energy transfer associates with the solute distribution between stationary phase and mobile phase, in which the interactions involve dipole-dipole, hydrogen bonding, and dispersive forces. The retention of structurally diverse solutes under an experimental condition can be modeled using LSER, where descriptors, E, S, A, B, V, represent the ability of the solute to participate in various solute-phase interactions, and their coefficient, e, s, a, b, v, expresses the system response to these interactions. However, LSER model retention as a function of various solvent properties for a single mobile composition. LSST can accurately predict retention for single solvent with multiple mobile compositions (Quarry et al. 1986):

= loglog w − Skk φ (7)

where kw is the theoretical value of k for only water as mobile phase, S is the LSST coefficient, ɸ is the volume fraction in mobile phase composition. The quadratic term ɸ2 is often neglected, since ɸ is smaller than 1. However, with decreasing polarity of the solvent, the relative importance of the quadratic term increases (Jandera et al. 1982). The limitation of LSST model is lack of prediction for new solvents. When LSST and local LSER models are combined into single model, a global LSER model can predict various solvents at multiple mobile compositions (Wang et al., 1999; Wang and Carr, 2002). By adding a quadratic term it is possible to compensate for non-linearities and further improve the predictive power of this combined model (Liu et al., 2010):

2 log 21 φφ +++++++= vVbBaAsSeESSck (8)

The global LSER model provides both of accuracy and precision in retention prediction and will be useful in practice for chromatographic development.

15 Molecular Descriptors The crucial point in the QSPR approach is an appropriate description of the molecular structures. The chemical descriptors take account of the different aspects of the chemical information. The molecular descriptor expresses chemical information transformed and encoded from a molecule and effectively solves chemical, pharmaceutical, and toxicological problems. The advantage of theoretical molecular descriptors is in developing compounds that have never been synthesized or explored experimentally. Molecular descriptors are well known for their ability to establish linear regression relationships with physicochemical and biological properties. Constitutional descriptors can simply obtain the information from the chemical composition of the compounds. Typical constitutional descriptors include molecular weight, total numbers of atoms, bonds, and rings in the molecule. These are one-dimensional (1D) descriptors that can easily be obtained with knowledge of the molecular formula. Unfortunately, the linear relationship between constitutional descriptors and physicochemical properties cannot sufficiently prove that a physical mechanism reflects molecular interactions. Topological descriptors are two-dimensional (2D) descriptors obtained from information on the molecular topology; they express the atomic connectivity in the molecule and can be calculated from 2D graph representation of molecules (Randić, 2001). Geometrical descriptors are derived from three-dimensional (3D) structures of molecules defined by the coordinates of atomic nuclei and the size of the molecules represented (Karelson, 2000). Weighted holistic invariant molecular (WHIM) descriptors contain another prospective class of molecular geometric parameters, WHIM descriptors are obtained from the molecular (x, y, z) coordinates of the molecules with 3D structure, in which the descriptors capture detailed information on molecular size, shape, symmetry and atomic distribution. The derivation of WHIM descriptors can be obtained from centered molecular coordinates using the PCA approach from the weighted covariance matrix of atomic coordinates (Todeschini et al., 1996; Todeschini and Gramatica, 1997). The molecular descriptors, geometry, topology and atom-weights assembly (GETAWAY), encode the geometrical information obtained from the molecular matrix together with the topological information obtained from the molecular graph, as well as information from atomic weights (Consonni et al., 2002). Electronic descriptors reflect the electronic structure of the molecule, based on 3D structure and the charge distribution in the molecule. The descriptors can be obtained from the calculation with ab initio or semi-empirical approaches.

16 The Definition of the Applicability Domain The lack of experimental data for the majority of the organic compounds in commercial use has increased the importance of QSPR to evaluate and predict the reactivity of compounds not yet tested. Before a QSPR model is put into use for screening compounds, its applicability domain (AD) must be defined and predictions for only those compounds that fall in this domain can be considered reliable (Tropsha et al., 2003). Today, the problem of the AD of chemical models has also received considerable . According to REACH, EU requires a clear estimation of the accuracy of developed QSPR models before they can be used within the REACH system. However, there is no universal technique for the definition of AD. AD of a given QSPR model includes the physicochemical domain, the descriptor domain, the chemical domain, which is termed or structural domain, and metabolic domain (Dimitrov et al., 2005). The AD defined based on the OECD guidelines is (Netzeva et al., 2005): The response and chemical structure space in which the model makes predictions with a given reliability The AD was defined using three different methods: PLS, the Williams plot, and associate neural networks (ASNN):

The Distance-Based Approach Using PLS The distance-based methods calculate a distance from the prediction set compounds to the calibration chemicals. The definition of AD using Euclidean and Mahalanobis distance measures was derived from the projections of the multivariate descriptor space (Öberg, 2005), and these two distance was used to determine whether the object was within the AD or not (Öberg, 2004). Euclidean and Mahalanobis distances: the Euclidean distance is the square root of the squared differences between corresponding elements of the rows or columns in the distance matrix.

n 2 2 2 2 (9) () qp 11 xxxxqpd qp 22 xx qnpn ∑ −=−+⋅⋅⋅+−+−= xx qnpn )()()()(, i=1

where d(p,q) is the distance between two points p and q; xpn is the value of point p along axis n; xqn is the value of point q along axis n. The Mahalanobis distance is a weighted Euclidean distance, in which the weighting is determined by the sample variance-covariance matrix. The block distance is the sum of the absolute differences between corresponding elements of the rows or columns in the distance matrix.

2 2 2 () qp 111 xxwxxwqpd qp 222 −+⋅⋅⋅+−+−= xxw qnpnn )()()(, (10)

17 where wn is the weight assigned based on the importance of the nth descriptor in the model. The total weights of the descriptors can be obtained by calculation of the model coefficients with centered and scared data. And therefore, the most important descriptor has a coefficient of 1.0. PLS model was assessed using the residual standard deviation (the Euclidean distance to the PLS model) and the leverage (the Mahalanobis distance within the PLS model space) to that of the calibration objects. From this distance the SIMCA software estimated the probability that the observations belong to the model (PModXPS) in calibration model. And PModXPS+ combined PModXPS and Hotelling’s T2 measured how far outside the acceptable model AD the projection of the observation falls in prediction model (Eriksson et al., 2006), in which weak outliers could be seen in PModXPS, and strong outliers were outside the Hotelling T2 ellipse. Here we use a probability of belongs to the model at the 99% significant level (PModXPS and PModXPS+ < 0.01) as the cut-off criteria to decide if a compound was outside of the model AD. Hotelling’s T2 was named for Harold Hotelling (Hotelling, 1931). The Hotelling T2 for observation based on A components is: A t 2 T 2 = ia (11) i ∑ 2 a=1 sta 2 where Ti is variance of ta according to the class model.

2 2 i ()−∗ (NAANNT −1) (12)

is F distribution with A and N-A degree of freedom, where N is the numbers of observations in the calibration model set. If

2 2 i ( 1) ( ) critical (pFANNNAT =∗−−> 01.0 ) (13)

Then observation i is outside the 99% confidence region of the model. The Hotelling T2 tolerance region for a two dimension score plot of components a and b is an ellipse with axes:

2 2 **1/2 (S ta or b* F2,N-2,a*2(N -1)/(N(N-2)) (14)

The Williams Plot The model apace can be represented using a two dimensional matrix X comprising observations N and variables K. The leverages or Hat values (h) from the matrix representing the distance of the molecule to the model structural space were calculated as:

T T -1 hi = xi (X X) xi (i = 1, 2…, n) (15)

18 where xi is the vector of descriptors of the considered compound and X is the model matrix derived from the training set descriptor values. In linear modeling, the leverage, h, ranges between 1/N and 1. And therefore, the averages (K+1)/N for the N compounds in the data set, the warning leverage h* is expressed as:

∗ += /)1(3 NKh (16)

Leverage values can be calculated for both calibration and prediction compounds. When h < h* for a compound, this compound in the prediction set may belong to AD of the calibration set; conversely, if h > h*, the compound in the prediction set is structurally distant from the calibration compounds, and therefore, it can be considered outside the AD of the model (Gramatica, 2007). The Williams graph, the plot of hat value (h) versus standardized residuals, verifies the presence of response outliers where the compounds with cross- validated standardized residuals larger than 2.5 standard deviation units, and compounds structurally very influential in determining model parameters.

Associative Neural Networks (ASNN) An associative neural network (ASNN) is a combination of an ensemble of the feedforward neural network and the K-nearest neighbors technique, and ASNN can calculate nonlinear models between indices and molecular properties (Tetko, 2002a; Tetko, 2002b). The ASSN model was based on an ensemble of 100 neural networks, and therefore, each model calculated one predicted value for any molecules. These 100 values were used to form a vector Ycalc, this vector corresponded to a new representation of a molecule in the model space for the molecule in both of the calibration and prediction set. For each analyzed molecule, z, in the calibration set with a maximum correlation coefficient between them (Tetko et al., 2006):

= 2 YYRziCorrel ji )(max),( (17)

R2 corresponds to 1-Euclidian distance, this similarity measure corresponded to the minimal Euclidian distance between the molecules in the prediction and calibration set in the apace of activities predicted from the models. As a result, a molecule was determined in the prediction with a maximum correlation with the analyzed molecule in the model space. A cut- off value of R2 = 0.7 was used to define the AD of the ASNN model. And therefore, if the analyzed molecule had R2 > 0.7 at least to one of calibration set molecules in the space of models, it was considered inside of the AD of the models, otherwise, it was considered outside of the AD of the model (Zhu et al., 2008).

19 Data Analysis Multiple Linear Regression (MLR) MLR is a classical approach to model one response variable, Y, as a linear combination with one or more explanatory variables, X. The residual ε represents the deviation between the dependent variable and the independent variables.

yi i i i i xxxxx 55443322110 ++++++= εββββββ ii (18)

The MLR approach is not suitable for analyzing collinear variables. If variables in independent data sets exhibit collinearities, the regression coefficients, β, become unstable and difficult to interpret. The number of observations must exceed the number of variables (Topliss and Edwards, 1979). It is assumed that X-variables are exactly and completely relevant to the response variable. Principal Component Analysis (PCA) PCA is a basic statistical method in multivariate data analysis (Wold et al., 1987). The multivariate projection technique of PCA extracts the systematic variance in the data matrix and effectively used in a covariance analysis between different factors. Karl Pearson first formulated PCA in 1901 (Pearson, 1901), the crucial application of PCA is that the transformation of a vector space reduces multidimensional matrix to lower the dimensionality of the data set. PCA decomposes matrix X so that a score matrix T multiplies a loading matrix P, then plus a residual matrix E:

X = TPT + E (19)

The mathematical principle of PCA is eigen analysis: PCA can find the eigenvectors and eigenvalues of a square symmetric matrix with sum of squares and cross products. An eigenvector of the covariance matrix corresponds to the largest eigenvalue in the same direction, and this eigenvector expresses the first principal component; another eigenvector with the second largest eigenvalue represents the second component that is orthogonal to the first component. The rest can be deduced by analogy: the third component is orthogonal to the first and second component and so on. The number of components is determined by cross-validation (CV) (Wold, 1978).

20 X3 PC1

PC2

X2

X1

Figure 1. The geometric principle of PCA, in which two principal components define a model plane.

Scaling and mean centering are used to standardize the data matrix in the data analysis. Scaling is used to equalize the length of a vector, the length of each coordinate axis in the variable space is set to be the same; one point (in figure 1) is situated in the middle of the point swarm, and this point moving to the origin of the coordinate expresses the mean centering. Two principal components form a plane that is a window of the multidimensional space, and the values of observations projected on the plane give the score. The score and loading can be plotted to overview the relationship between the observations and variables, the overview of PCA shows the grouping of the observations, trends, and outliers (Eriksson et al., 2006). Partial Least Squares Projection to Latent Structures (PLS) PLS is a robust, multivariate statistical extension of MLR, and is an approach for relating two data matrices, X and Y, to each other by a linear multivariate model. It expresses a projection to latent structures by means of partial least squares, and the precision improves with increasing numbers of X variables.

21

PT

T PLS U X Y

W*T CT

New X Predicts Y ???

Figure 2. The model of multivariate calibration and the prediction for new compounds.

PLS decomposes the block matrix X and Y:

X = TPT + E (20) Y = UCT + F (21)

Matrices X and Y are connected by an inner relation, H is the residual matrix:

U = T + H (22)

PLS modeling of the relationship between variable X and response Y can be described by equations (20) and (21); the information related to the observations from X and Y is stored in the score matrices T and U, and the information related to the variables is stored in the loading PT of matrix X and the weight CT of matrix Y; E and F are residuals of X and Y, respectively. PLS simultaneously projects the variables of X and Y onto the same subspace, T, and there is a relation between the position of one observation on the plane of X corresponding to the position on the plane of Y. The scores are calculated for the variation in matrix X and are well correlated with Y. A PLS model is calculated, in which chemical molecular descriptors can predict physicochemical properties as a response. The score T from untested compounds can be calculated from matrix X, and through the previously calculated weights, CT in matrix Y, the new physicochemical properties can be predicted.

22 The relationship between X- and Y-variables: the scores t and u consist of information from the projection of the observations X and Y, respectively; the weight w profile displays the covariation of the X-variables with u; the loading p profile displays the covariation of the X-variables with t; the weight c profile displays the covariation of the Y-variables with t. NIPALS based on can tolerate approximately 20% randomly distributed missing data for small data sets, and moderate amounts of random missing data for larger data sets (Kettaneh et al., 2005). Pre-processing data: if a variable contains one or more extreme values in the data set, these values will dominate over other values and unduly influence the model. After transformation, the distribution of the data becomes fairly symmetrical; for example, logarithm transformations may remedy skewness in the model. In addition, data are generally centered and scaled to unit variance before analysis. Scaling each variable to unit variance by dividing it by its standard deviation (SD) ensures that all variables have an equal opportunity to influence a regression model; centering each variable by subtracting the average value of all the variables facilitates interpretation and removes some numerically unstable factors (Eriksson et al., 2003). Scaling to unit variance and centering assume that each variable has the same weight and same prior importance. The numbers of PLS components: if too many components are used, noise is modeled, and the model becomes “overfitted”; if an insufficient number of components are chosen, part of the relevant variation is missing and the model shows “underfitting” (Martens and Næs, 1996). Cross-validation (CV) is a reliable approach for model option based on the predictability of the models (Brereton, 2007), and it is effectively used to determine the numbers of the dimensionality in the model. CV is performed by which observations are omitted of the model development, and then the response values (Y) for the omitted observations are predicted by the model and compared with the real values. This procedure is repeated several times until every observation has been omitted once and only once. After developing a reduced model, the difference between the real and predicted Y values is calculated for the test set. The predictive residual sum of squares (PRESS) represents the sum of the squares (SS) of these differences from all the parallel models:

) 2 (23) PRESS ∑ ( −= yy imi ) ) where yi denotes the observed Y value, yim represents the predicted Y value, i expresses different observations, and m stands for different dependent variables. PRESS is transformed to R2 to estimate the fit to the data, and R2 measures how much variation in Y is described by the model: RSS 2YR 0.1 −= (24) SSY .corrtot

23 Q2 estimates the predictive ability, and expresses how much variation is predicted by the model: PRESS 2YQ 0.1 −= (25) SSY .corrtot R2Y is usually higher than Q2Y.

SSY is equal to the initial sum of squares for the dependent variables. The error in the prediction of the compounds can be calculated by the root mean square error of prediction (RMSEP). RMSEP is the average prediction error in the same unit as Y, and is used to measure accuracy: PRESS RMSEP = (26) N where N is the number of compounds in the validation set. Neural Networks Artificial Neural Networks (ANN) are paradigms that are inspired by the way biological nervous system, just as the brain consists of neurons that are connected with one another. A neural network consists of interrelated artificial neurons, the neurons work together to solve a given problem. The classification of ANN: Two mainly different neural networks, supervised neural networks (SNN) and unsupervised neural networks (UNN), are widely employed for data analysis in chemoinformatics and bioinformatics (Schneider and Wrede, 1998). SNN are used in function approximation, classification, pattern recognization and prediction tasks, in which, one or more corresponding functional values for each data point must be already know. However, UNN can be employed to similar tasks even without prior knowledge of molecular activities, and UNN can automatically recognize the clusters of data in a given set based on a rational representation of the compounds and a sensitive similarity measure. Building blocks of neural network architecture: ANN consists of two elements: formal neurons, and connections between the neutrons. At least two layers are required for construction of a neural network, including an input layer and an output layer. Fully connected feed-forward networks are by far the most frequently used neural networks for chemoinformatics, in which every neuron of one layer is connected to each neuron of the following layer, and no intra- layer connections exist (Zupan and Gasteiger, 1993). Unsupervised training of neural networks: a special type of UNN, self organizing maps (SOM) or Kohonen networks, was used for splitting approach to obtain prediction set in paper IV. SOM comprises connected nodes that have a data vector associate with them. Descriptor vectors are calculated for test molecules and SOM nodes are assigned corresponding

24 vectors. As a result, each test molecule is mapped to the node having the smallest distance to its descriptor vector in chemical space (Kohonen, 1982). Compounds distribution can be projected from high dimensional descriptor spaces on two dimensional arrays of nodes, and therefore, SOM is a dimension reduction technique. Supervised training of neural networks: feed-forward networks in SNN were used in paper IV. All neutrons in a supervised net, a data vector is fed into the network by computing an output value each time. Testing the ability of neural network models: cross-validation was used to evaluate the prediction performance.

25

26 Physicochemical Properties

Vapor Pressure (P) Vapor pressure is one of the important physicochemical properties in predicting the distribution of chemicals. The vapor pressure is the pressure of the vapor of a compound at equilibrium with its pure condensed phase (liquid or solid) and depends on the ambient temperature. At a given temperature, the energies of the chemical molecules in the condensed phase have a distribution of energies based on the Maxwell-Boltzmann distribution, and some of molecules with sufficient energy overcome the forces of attraction from their neighboring molecules. If these energies on the molecular surface of the condensed phase are greater than the molecular interaction forces, the energies will escape from the condensed phase. When the rate of escape from the condensed phase equals the rate of condensation, the partial pressure of the gas is the vapor pressure at equilibrium. The vapor pressure is classified as solid vapor pressure (PS) and (subcooled) liquid pressure (PL). The energy required to transfer molecules from the solid to gas phase is higher than transferring them from the corresponding liquid to gas phase; PL is thus larger than the corresponding PS of the solid compounds. Knowledge of the properties of subcooled liquid compounds is useful to quantify the molecular situations in the environment in which the molecules are in liquid phase, e.g., dissolved in water, even though the compounds are solids in their pure state. We use the pure liquid compound as the reference state for describing partitioning processes. Vapor pressure can be directly measured using gas chromatography (GC) (Hinckley et al., 1990).

Partitioning Coefficient n-Octanol/Water (Kow) The organic solvent n-octanol is widely used for predicting partitioning between organic phase and water. The partitioning coefficient Kow is the ratio of the concentration of a dissolved compound in two phases of these immiscible solvents at equilibrium: C −octanoln (27) Kow = Cwater Kow is sometimes denoted as P (for partitioning). An n-octanol molecule has an apolar part and another bipolar functional group; this molecular property ensures favorable interactions with bipolar and monopolar solutes, and therefore n-octanol as a solvent can efficiently accommodate most kinds of solutes. Kow has been accepted as a descriptor of the hydrophobicity of a molecule. Hydrophobicity is a crucial physicochemical parameter determining

27 both the pharmacokinetic and pharmacodynamic behaviors in a molecule. The hydrophobic effect is the main driving force for drugs binding to their receptor targets (Eisenberg and McLachlan, 1986), and compounds with higher Kow values often bioaccumulate. The classical and reliable technique for determining Kow is the separation funnel method: the compounds dissolve in the two phases at equilibrium, and ultraviolet/visible (UV/VIS) spectroscopy is normally used to measure the distribution of solute. Another quick approach is high performance liquid chromatography (HPLC), in which the solute passes through the reversed- phase column (e.g., C18 alkyl chains covalently bound to silica beads), the hydrophilic chemicals elute first, and hydrophobic compounds elute last. The retention factor k is expressed as: − tt k = R 0 (28) t0 where tR is retention time, and t0 is hold-up time. Kow is deduced from the retention factor k (Valkó, 2004):

log ow = ∗ log + bkaK (29)

The regression coefficients a and b must be determined by appropriate reference compounds from each chromatographic system. Once a chromatographic system is established and calibrated, many compounds can be investigated immediately. The classical method for predicting Kow is the fragment or group contribution method (Hansch and Leo, 1979). The logKow of a compound is determined by the sum of its non-overlapping molecular fragments, and the fundamental fragments are derived from a limited number of simple molecules. This method cannot be used to predict Kow for molecules containing unusual functional groups in which the parameters have not been developed. The similar atom/fragment contribution (AFC) method can be used to estimate logKow (Meylan and Howard, 1995). With a large database of logKow values, fragment coefficients and correction factors can be derived by MLR of given fragment coefficients and correction factors. The equation of estimation is as follows:

(30) log ow ∑ kk ∑ cnfnK jj +⋅+⋅= constant k j

where fk is the fragment constant, cj is the correction factor, nk and nj are the frequency of each type of fragment and specific interaction, respectively. When the electronic or steric interactions of functional groups within a molecule influence the solvation of the compounds, correction factors are required. If the interaction reduces the overall hydrogen donor or hydrogen acceptor capability of the compound, the correction factor is positive; if the

28 opposite is the case, the correction factor shows negative values. These calibration models often become complex with hundreds of fragment coefficients and correction factors.

Water Solubility (Sw) Water is one of the major transport and reaction media for organic compounds in natural systems due to small size of the water molecule and the hydrogen bonding properties. Water solubility reflects the concentration of a compound for a saturated solution in equilibrium. For the assessment of the behavior and effect of organic compounds in the environment, accurate data on water solubility is of the utmost importance. The shake-flask method is a conventional determining approach, which works well for many soluble chemicals at equilibrium. However, the experiments become difficult for sparingly soluble compounds, such as PAHs, PCBs, and dioxins, due to the formation of emulsion or microcrystal suspensions. In these cases, we can use the filtration approach with the help of an adsorption mechanism. Prediction of aqueous solubility is important in the environmental sciences. The most useful approaches for estimation of water solubility in nonaqueous solutions are based on a functional group contribution. The universal functional activity coefficient (UNIFAC) is a reliable and faster approach for predicting the aqueous solubility/activity coefficients of non-electrolytes in liquid mixture (Fredenslund et al., 1975; Hansen et al., 1991). This model has been extended to water as a solvent. The fundamental principle is that a compound is composed of functional groups, and each group makes a unique contribution to aqueous solubility. A large number of compounds merely contain limited numbers of functional groups, and therefore it is possible to predict many compounds using a small amount of organic functional groups. However, the accuracy of the UNIFAC model remains controversial (Kan and Tomson, 1996). Another similar approach focusing on the prediction of aqueous solubility is well known as aqueous functional group activity coefficients (AQUAFAC) (Myrdal et al., 1993). Its fundamental idea is to express enthalpic and entropic contributions to the excess free energy by summing up interactive terms of parts of solute, solved organic molecules, and their functional groups. Some promising methods for predicting water solubility are based on topological, geometric, and electronic molecular descriptors (Mitchell and Jurs, 1998). The advantage of this QSPR approach is that it can be used for any compound for which the molecular structure is known; the disadvantage is that software is required, and the principles behind the method may not be transparent for the user.

29 Melting Point The melting point (MP) of a solid is the temperature at which the vapor pressure of the solid and the liquid are equal. Melting or fusion is the phase transition in which the crystalline phase changes to the liquid phase. At a fixed pressure, the melting point of a crystal of a pure substance is the temperature at which the crystalline and liquid phases are in thermodynamic equilibrium. The heat required to melt the crystal is absorbed at a constant temperature and is referred to the latent heat of melting, or latent heat of fusion. The latent heat of fusion is equal to the enthalpy change of fusion, since melting occurs at a fixed pressure. On the contrary, if the temperature of the reverse change from liquid to solid, it is referred to freezing point or crystallization point. However, the freezing point is not considered to be a characteristic property of a substance due to the ability of some substance to supercool. When a crystal melts, at the melting point the change in Gibbs free energy (ΔG) of a substance is zero, however, both of the enthalpy and entropy of a substance increase. The enthalpy change of fusion is a measure of the amount of energy required to convert the crystal to a liquid. The entropy change of fusion is a measure of the increase of randomness or disorder when the molecules are released from the constraints of the crystal into the relative freedom of the liquid (Brown and Brown, 2000). When melting occurs, the Gibbs free energy of the liquid becomes lower than the solid for this substance. A high melting point could be caused by a high enthalpy of fusion or low entropy of fusion or both: ΔH fus (31) Tfus = ΔS fus With the increasing in molar mass, molecular melting point usually increases. And high molecular symmetry is associated with high melting point, since symmetry in a molecule lead to the effect of reducing entropy gained on melting of that substance (Pinal, 2004).

Boiling Point The boiling point (BP) of a substance is the temperature at which the vapor pressure of the liquid equals to the environmental pressure surrounding the liquid. The boiling point represents the point at which the liquid molecules possess enough thermal energy to overcome the various intermolecular attractions binding the molecules as liquid, and therefore, the boiling point is an indicator of the strength of weak attractive forces between the molecules of a liquid. The increase in boiling point is correlated with the increase in molecular size and polarizability. The normal boiling point of a liquid is the specific case in which the vapor pressure of the liquid equals the defined atmospheric pressure at sea level under 1 atmosphere.

30 The boiling point increases with increasing pressure up to the critical point in which the gas and the liquid properties become identical, the boiling point cannot be increased beyond the critical point. Likewise, the boiling point decrease with decreasing pressure until the triple point is reached, and the boiling point cannot be reduced below the triple point. Boiling point was used in QSPR model to predict vapor pressure, and the vapor pressure of chemicals affects their resident time in soil, water and air. And therefore, this model is useful to predict the distribution of transport of chemicals into environment, and vapor pressure can be used in fugacity calculations (Müller and Klein, 1993).

Perfluorinated Compounds Perfluorocarbons are compounds containing fluorocarbon derived from hydrocarbons by replacement of hydrogen atoms by fluorine atoms. Perfluorocarbons are made up of only fluorine and carbon atoms. A perfluorocarbon can be a linear, cyclic or polycyclic compound. Perfluorinated compounds (PFCs) are perfluorocarbons or derivatives with some functional group attached. PFCs show chemical inertness and thermal stability due to the strength of the carbon-fluorine bond and the shielding effect of the fluorine atoms. The electronegativity of fluorine reduces the polarizability of the electron clouds, and further reduces van der Waals forces between fluoro compounds (Lemal, 2004). Several perfluorocarbons are gases at normal temperature; however, perfluorocarbon liquids are colorless and high density attributing to their high molecular weight. Very low intermolecular interactions lead to low viscosity in liquid, low surface tension, low heat of vaporization and particularly low refractive indices. As a landmark development in organofluorine chemistry, Roy J. Plunkett serendipitously discovered Teflon at Dupont in 1938. Since the 1950s, PFCs have been widely used in a variety of industrial and consumer products. The physicochemical properties of fluorinated surfactants are distinctly different from those of hydrocarbon surfactants (Moody and Field, 2000), it is well known that surfactant molecules of hydrocarbon consist of one polar head group and another hydrophobic tail; the molecules of fluorinated surfactants containing one hydrophilic head, however, the nonpolar tail possesses of hydrophobic and oleophbic (oil-repelling) properties. Fluorinated surfactants are widely used for aqueous film fire-fighting foams (AFFFS) in mining, oil well industries, insecticide, alkaline cleaner, floor polishes, shampoos, as well as protectors in carpets, leather, paper, fabric and upholstery (Giesy and Kannan, 2002). PFCs such as perfluorinated acids (PFAs), Fluorotelomere alcohols (FTOH), and perfluoroalkyl sulfonates (PFAS) are a cause for environmental

31 concerns, since the stability of the C-F bond in PFCs makes them long-range transport, persistent, and accumulation in environment (Armitage et al., 2009). In 1999, the United States Environmental Protection Agency (U.S. EPA) began an investigation of PFCs due to persistent and bioaccumulative properties in human population (Olsen, 2003) and in wildlife (Kannan et al., 2002) around the world. Recent studies on animals have revealed negative effects in expression of the greatest number of genes (Guruge et al., 2006). These scientific results provide that PFCs sustaining accumulation in the environment have threatened human health and ecosystem. Fluorotelomere alcohols (FTOH) are widely used as intermediates during the synthesis for coatings, polymers and paints in the surface-active modification of consumer products (Dinglasan-Panlilio and Mabury, 2006). Metabolism of the 8:2 FTOH in rat produced perfluorooctanoate (PFOA) (Hagen et al., 1981; Martin et al., 2005), and FTOH acts as precursors to the perfluorinated acids (PFAs) that have been detected in the environment. 6:2 FTOH could disturb fish reproduction through endocrine disrupted activity (Liu et al, 2009). In this thesis, we mainly develop PLS models with different kinds of theoretical molecular descriptors, including the group or fragment descriptors referred to above. We choose the most relevant variables based on variable importance in the projection (VIP) for predicting the physicochemical properties and retention factors in RPLC, as well as exclude the variables that are less related, as a result, more accurate and global models will be developed using PLS approach.

32 Results and Discussion

Quantitative molecular interactions govern a chemical’s physicochemical properties. The basic assumption is that van der Waals and polar interactions mainly contribute the free energy of transfer, and that these contributions are additive:

vdW polar 12 i 12 i Δ+Δ=Δ 12GGG i (32)

Prediction of Vapor Pressure We collected data for a diverse set of 920 organic compounds from the PhysProp Database (Syracuse Research Corporation, Syracuse, NY). Their melting points were below 298.15 K and molecular weights between 30 and 500 amu. In addition, we collected data for 420 organic compounds with liquid or subcooled liquid vapor pressure at 298.15 K from the literature (Mackay et al., 2006). In all, 1479 molecular descriptors were used for modeling (Öberg and Liu, 2008). The data were randomly split into a calibration set of 893 objects and a test set of 447 objects. After 23 outliers were excluded, and 378 molecular descriptors were selected, 99.0% of the variance was explained for 10 components in the calibration set, and the RMSEC was 0.294. In the test set, 27 outliers were excluded, 98.0% of the variance in the model was explained, and the RMSEP was 0.410. The bilinear PLS models usually accommodate nonlinear behavior by the addition of more latent variables, and 10 significant latent variables indeed indicated such nonlinear relationships. When the PLS model is global, it describes a large variety of compounds with vapor pressures varying over fourteen orders of magnitude. Local models that cover a more limited calibration domain with similar compounds could be expected to be both simple and more accurate. In the global PLS model, the first component explained 91% of the variance in vapor pressure and 33% of the variance in the descriptors. The descriptors of the first latent variable mainly contribute molecular size, van der Waals interactions, and therefore this latent variable is closely correlated to the Ghose-Crippen molar refractivity (Ghose and Crippen, 1987; Viswanadhan et al., 1989). The second component explained 4.6% of the variance of vapor pressure and 5.7% of the variance in the descriptors. The main contributors are the number of hydroxyl groups, the number of donor atoms for hydrogen bonding, and the topological polar surface area. However, the fluorine functional group reduces polarizability.

33 The results showed that the first two components capture the main information from the variance, and they correspond with LFER, in which van der Waals and polar interaction contribute to the free energy of transfer. We evaluated different sizes of subsets and number of latent variables to investigate the benefit of local modeling. There is a definite improvement in performance with the local model approach when only a few latent variables (one or two) are extracted. This difference diminishes and when six or more latent variables were extracted, a model using close to the full calibration data set shows the best performance, as measured by RMSEC. The relationship between RMSEC, the number of latent variables, and the number of neighboring compounds considered is show in figure 3.

0.9

0.8

0.7

0.6 RMSECV (logPa) 0.5

0.4 25 100 175 250 325

0.3 400 475 1 550 3 5

625 Neighbours 7 9 700 11 775

Number of latent 13 15 variables 850 Figure 3. RMSEP vs. the number of latent variables and neighboring compounds in the calibration model.

Local modeling did not improve the overall prediction performance; however, when we used only a few latent variables, a local model indeed performed better.

Comparison of Modeling of Environmental Properties between LSER and PLS We collected data from Physprop Database and Absolv Database ADME Boxes v 3.5 (Pharma Algorithms, Vilnius, Lithuania). In the LFER model, the factors are intended to describe different aspects of van der Waals and polar interactions. Abraham et al. developed LSER for the correlation with five solute descriptors that were determined from chromatography or

34 generated via water-solvent interactions using empirically derived equations (Abraham et al., 2004).

+= + + + + vVbBaAsSeEcSP (33)

Theoretical molecular descriptors were generated and reported as 1664 variables. However, the MLR approach is not applicable for a data matrix with a larger number of variables than the number of observations or for a matrix containing collinear variables. However, the PLS model works well in this case. The solvent properties are linearly related to the descriptors of organic molecular structures, and the mathematical relation can be expressed using score t and weight q in Y matrix in the PLS model (Wold and Sjöström, 1998):

44332211 +⋅⋅⋅++++= etqtqtqtqSPSP (34)

The residual, e, represents the deviation between the data and the model. We tested five dimensions in PLS corresponding to the e, s, a, b, and v coefficients in the LSER model. The principal components cannot represent exactly pure physicochemical properties, and the few data are orthogonal to each other, but the local directions in multidimensional space capture main variations, hence the variables are termed latent variables. The PLS model tolerates collinear variables and randomly missing data in the matrix, and the model with reduced dimensions is still stable. The data for developing the model were retrieved from the PhysProp database (Syracuse Research Corporation, Syracuse, NY) and the Absolv Database ADME Boxes v 3.5 (Pharma Algorithms, Vilnius, Lithuania). Experimentally determined values of Kow and Sw at 298.15 K were found for 1529 and 893 compounds, respectively. As some compounds lacked one or more of the solute descriptors and the data sets selected for modeling of Kow and Sw, 1505 and 884 compounds were used for modeling of Kow and Sw, respectively.

Prediction of Partitioning Coefficient (Kow) To develop LSER and PLS models, we chose 1505 compounds with widely different molecular structures (Liu and Öberg, 2008). The data set was randomly sorted, and the data were mean centered and scaled to unit variance. Scaling is to ensure that all the variables have equal opportunities to influence the model, and centering greatly facilitates interpretation of the model. The variable partitioning coefficient contained a few extreme values that exerted a large influence on the model and dominated over other measurements; logarithmical transformation resulted in a fairly symmetrical distribution of the data.

35 To test the predictive power of the model, the data were split into a calibration set with 2/3 of the compounds and a prediction set with 1/3 of the compounds. We developed LSER with 1003 compounds and 5 solute descriptors, after exclusion of 37 outliers. The LSER equation is:

Kow = + − − − + 698.3371.3042.0059.1658.0129.0log VBASE n= 966, R2 = 0.978, SD = 0.28, F = 8.7×103 (35)

where the unit of V is cm3mol-1/100, n is the number of the compounds, R2 is the overall correlation coefficient, SD is standard deviation, and F is the Fischer statistics. The term hydrogen bond acidity A is not significant; therefore, we recalculated the model after excluding the hydrogen bond donor (HBD) descriptor:

Kow = + − − + 707.3380.3066.16567.01235.0log VBSE n= 966, R2 = 0.978, SD = 0.28, F = 1.09×104 (36)

Abraham et al. reported an LSER equation with 5 parameters (Abraham, Chadha, Whiting and Mitchell, 1994):

Kow = + − + − + 814.3460.3034.0054.1562.0088.0log VBASE n = 613, R2 = 0.995, SD = 0.116, F = 2.3×104 (37)

The fit of the Abraham equation is better than the result we reported, probably because of our choice of different organic molecules, and because we collected compounds with widely different molecular structures. However, if we used our calibration data to fit their equation, the two equations showed similar results.

Table 1. Comparison of the model of logKow between LSER and PLS Na) Ab) LSER PLSR 2 2 2 2 R Q Ext RMSEC/ R Q Ext RMSEC/ RMSEPc) RMSEP Calibration 966 4 0.978 0.280 0.957 0.411 Prediction 502 4 0.955 0.410 0.933 0.498 a) Number of compounds, b) Components, c) Root mean square error of calibration and root mean square error of prediction.

To compare with the LSER model, we used 1664 theoretical molecular descriptors to calculate logKow. The PLS calibration model is mathematically and statistically equivalent to a LFER (Wold, 1974). PLS can tolerate collinear variables and missing data better than MLR in short and wide data

36 matrices. In a large number of molecular descriptors, not all were relevant to the model. Variable importance in the projection (VIP) showed an importance in the PLS model; the terms with the largest values were the most relevant for explaining Kow (Wold et al, 2001). VIP is the sum of all the model dimensions of the contributions of the variable influence. We selected the 92 largest descriptors in the model without any lowest values; the numbers of the components were set to four, the same as those in the LSER model. In the PLS approach, we chose four components that were same as in the LSER model. The first two components explained 93.4% of the variance in Kow; the third and fourth components only increased the explanation by 2% on the response variable. the first component is strongly correlated to the McGowan molar volume, which is equivalent to the van der Waals volume (Zhao et al., 2003). Hydrogen bond acceptor (HBA) and polarizability have been considered to be crucial factors (Dearden et al., 2000) and were described by the second component. The results of PLS coincide with those in LSER; molecular volume (V) and the molecular basicity descriptor (B) are significant in the LSER equation. The results showed that van der Waals and polar interaction mainly contributed to the free energy of transfer.

Prediction of Water Solubility (Sw) We collected 884 compounds and the data were randomly split into a calibration set with 589 compounds and a prediction set with 295 compounds (Liu and Öberg, 2008). The LSER with five parameters is:

Sw = − + + + − 952.3683.36697.05896.09462.0679.0log VBASE n= 552, R2 = 0.948, SD = 0.618, F = 1.98×103 (38)

Abraham et al. reported an LSER equation with 594 compounds (Abraham and Le, 1999):

Sw = − + + + − 050.4279.3646.0851.0061.1849.0log VBASE n= 594, R2 = 0.895, SD = 0.63, F = 1.04×103 (39)

Table 2. Comparison of the model of logSw between LSER and PLS N A LSER PLSR 2 2 2 2 R Q Ext RMSEC/ R Q Ext RMSEC/ RMSEP RMSEP Calibration 552 5 0.948 0.618 0.959 0.546 Prediction 295 5 0.879 0.937 0.922 0.740 Abraham et al. reported that SD cannot be lower than 0.5 log units when the molecules contain widely different structures (Abraham and Le, 1999). The parameters in these two equations are very similar; the coefficients of B and V showed significance, that is, hydrogen bond acceptors (HBA) show a

37 positive effect on aqueous solubility; in contrast, larger molecules show negative results on aqueous solubility. We used five components to develop a PLS calibration model to predict logSw of 295 other compounds, 92.2% variation was predicted, with the model showing a high power of prediction. The first two components explained 92.9% of the variation in the response variable; two more components only increased the variance by 3%. The results of PLS and LSER showed that hydrogen bond acceptor (HBA) and molecular size (V) mainly contributed to water solubility. Comparison of Models - LSER and PLS

We predicted logKow and logSw with five Abraham parameters in the LSER model. The E and V descriptors are calculated from molecular structure, and the S, A, B descriptors are generated from experiments or estimated from empirically derived equations. It is easy to interpret the physicochemical properties in the LSER model: obviously, the coefficients in the LSER equation directly give information of positive or negative effects on the physicochemical properties (Kamlet and Taft, 1985). The compounds with a series of similar structure can generate a better-fitted model. LSER uses MLR based on the assumption that each X is a fixed variable. It is limited in that several related data series were analyzed simultaneously, and another assumption of MLR is that the predict variables are exactly relevant for the response variable in the model - MLR cannot provide any diagnostics for testing the validity of this assumption. Lack of collinearity is difficult to meet chemical data, and collinearity leads to increased errors in the model (Quinn and Keough, 2002). PCA and PLS can overcome collinearity and missing data (Wold and Sjöström, 1986), and the score plot and loading plot as the diagnostic tools effectively search outliers in the model (Wold, 1993). In the PLS approach, the first principal component can capture main chemical information that represents main variance (Kamlet et al., 1987). Principal component analysis requires that the second principal component is orthogonal to the first, the third principal component is orthogonal to the first and second, and the fifth principal component is orthogonal to the first, second, third and fourth. Although the PLS factors are not always possible to interpret directly in chemical term (Kvalheim, 1988), rotations or projections enhance the interpretation. The similarity between the LSER and PLS models indicates that a direct chemical interpretation of the latent variables is possible (Öberg, 2007). The main factors in the PLS model are related to the molecular volume, polar property, and polarizability. The dimensionality of PLS is much lower than for the group contribution models, and the PLS approach avoids both “overfitting” and “underfitting” the model. The predictive performance seems

38 to be similar for both LSER and PLS approaches. The choice of model should therefore be driven by the availability of data and predictive power.

Extension of a Predictive Model for Vapor Pressure We updated an existing QSPR model of vapor pressure to cover the chemical domain of interest (Öberg and Liu, 2010). This approach can integrate general information from other parts of the chemical space. Different sample weighting schemes have been proposed as a possible solution to update an existing model with only a few new calibration samples (Stork and Kowalski, 1999; Capron et al., 2005). The model structure can be revised during this update or it can merely be on adjustment of the parameters. A few new samples represent the key PFCs compounds. The predictive performance is subsequently evaluated with experimental data not used for the model recalibration. The influence of different weighting on sample leverage, domain coverage, and predictive performance is described. The updated model is finally applied to a larger set of environmentally relevant PFCs. The empirical vapor pressure data for the selected PFCs were collected from the literatures (Steele et al., 2002a, 2002b; Kaiser et al., 2004; Kaiser et al., 2005; Krusic et al., 2005) and recalculated to a liquid or subcooled vapor pressure (PL) at 298.15 K. Additionally, 434 long-chained surface active PFCs molecules were selected for further study. A simple approach to give samples additional weight is by including several copies of each of them. Such a sample weighting scheme is of particular value when only a few new objects are available from a data cluster outside the original calibration domain. Two previously excluded outliers together with the three selected compounds; perfluorononanoic acid, perfluorododecanoic acid, and 8:2 fluorotelomer alcohol, within previous model (Öberg and Liu, 2008) were used for recalibration. The sample weight of three new PFCs compounds needs to be increased to 14 to fit all three calibration objects inside the previously determined leverage limit. Compared with previous model, the performance of the new model showed a very slight decrease in explained variance (R2) and similar slight decrease in RMSEC and RMSEP. Subsequently, remaining eight PFCs compounds were used for validation, as a result, 99.4% variance was explained, and RMSEP was 0.199 units of log Pa. The method was compared with that of SPARC v4.5 and the EPI Suite v3.20; SPARC is a computer program for automated reasoning in chemistry and used to predict physical properties, and the EPI Suite is a set of computer programs for physical property estimations developed by the US EPA’s Office of Pollution Prevention Toxics and Syracuse Research Corporation. The results showed that the QSPR model provided accurate predictions for the validation set while both reference methods have a substantial systematic bias and overprediction for vapor pressure.

39 434 PFCs compounds were used for development of QSPR model. Applicability domain was defined at 5% confidence interval, as a result, 430 compounds fell outside the applicability domain of the original model; after recalibration, 204 compounds remained outside the applicability domain, and therefore, it is possible to predict the liquid vapor pressure for more than 200 PFCs in which no reliable experimental data are available. It is a successful method to update a previous model to cover an extended applicability domain, and selective sample weight is useful in developing new model.

QSPR Models of Perfluorinated Chemicals (PFCs) Perfluorinated chemicals (PFCs) are C4-C16 carbon chain compounds predominantly substituted by fluorine. Their physicochemical properties, such as melting point (MP) and boiling point (BP) of PFCs are crucial to determine the solubility, transport, distribution and environmental fate (Bhhatarai et al., 2010). 55 MP data points of and 77 BP data points were collected from the SRC- Physprop database. Additionally, 19 data points of MP were collected from the literature of Trepka et al. (Trepka et al., 1974), 7 data points from Gajewski et al. (Gajewski et al., 1988) and 13 data points from Platonov et al. (Platonov et al., 2001), meanwhile, 16 data points of BP were collected from Hendricks (Hendricks, 1953). As a result, totally 94 and 93 compounds were available for QSPR modelling of MP and BP data respectively. The data were split by two different approaches: self organizing maps (SOM) and random approach. The calibration sets were used to select the modelling descriptors, and prediction sets were used to test the predictive performance of each model. In addition, a completely blind external validation set was collected from PERFORCE data set (EV-set) to strongly assess the performance of predictive models. The compounds were expressed by SMILES to calculate E-state (Hall and Kier, 1995) and fragment-based descriptors. Subsequently, the molecules were optimised using semiempirical AM1 approach of the HYPERCHEM software, and theoretical molecular descriptors were calculated using Dragon software. QSPR models were developed by four different groups: University of Insubria (UI), Ideaconsult Ltd. (IDEA), Linnaeus University (LNU), and the German Research Center for Environmental Health (HMGU). UI: Uncorrelated molecular descriptors were excluded from the results of calculation of Dragon, and the data sets were subjected to variable selection approach using Genetic Algorithm (GA) (Leardi et al., 1992). The development of models was carried out by MLR approach using the ordinary- least-squares (OLS) techniques. The models were initially developed by all the

40 subset procedure, then starting from the obtained combination, 100 different models were developed by exploring new combinations with 3 and 4 variables, GA selects new variables. IDEA: The models were developed using MLR based on OLS approach. Each coefficient in the models must be significant observing the formal criterion Relative Standard Deviation (RSD), RSD<25%; the choice of additional descriptors obeys a chemical logic in order to fix compounds with bad model value attempting to handle the missing structural fragments; the additional new descriptors must reduce RMSEE value and increase R2 and Q2 values. For UI and IDEA, several validation methods were employed such as 2 the bootstrap technique Q BOOT (Efron, 1979) (repeat 5000 times) and 2 randomly recorded Y-data termed Y-scrambling – R YS (repeated for 500 iterations). The predictive performance of the models was test in both the 2 2 2 splitting criteria internally by Q LOO and Q BOOT, at least two different Q Ext 2 2 were used: Q F1 (Shi et al., 2001) and Q F3 (Consonni et al., 2009). LNU: PLSR was used for developing models. The significant molecular descriptors were identified using VIP method, cross-validation (CV) was used 2 to calculate Q CV, the data analysis and multivariate calibration was carried out with the software SIMCA-P v.11.0. HMGU: Associative neural networks (ASNN) represent a combination of an ensemble of feed-forward neural networks and k-Nearest Neighbors (kNN). The neural networks ensemble of 100 networks with one hidden layer was used, the neural networks had 3 hidden neurons. 5-fold CV was used to estimate prediction performance. The descriptors and models were developed using OCHEM. The blind external validation set (EV-set) of compounds for MP and BP modelling were obtained from PERFORCE report, the EV-set contained long chain perfluoroalkylated chemicals, as well as 15 compounds for MP and 25 compounds for BP. QSPR Models with Melting Point MP models were developed using 94 compounds (93 for HMGU) containing mainly long chain alkylates, saturated cyclic and few aromatic PFCs, the data were splitting into calibration and prediction set, additionally 15 compounds from PERFORCE were used as EV-set.

Table 3. QSPR models for MP 2 2 2 2 2 2 Group R Q LOO Q BOOT RMSETR RMSEExt Q F1 Q F3 Q Ext UI 80.26 78.06 77.64 40.24 42.42 - - - LNU 82.57 76.3 - 38.64 25.96 - - 50.02 IDEA 79.80 77.70 - 40.67 38.82 65.10 81.60 HMGU 85.00 - - 37.00 34.00 - - -

41 UI: The 94 compounds were split into training and testing sets by SOM and random approaches, and the data were used to develop 100 models, each 2 2 containing 4 descriptors. The models were carried out with Q LOO and Q Ext, additionally, EV-set containing 15 PFCs from PERFORCE report was used to verify the model after the development of calibration model. The full model containing 94 compounds can be expressed as:

MP = 132.95(±22.63)AAC + 3.04(±0.97)F02[C-F] – 18.97(±7.98)C-013 + 209.04(±139.9) (40)

LNU: To compare with MLR models, PLSR models containing error data with 94 compounds and 1266 molecular descriptors were developed, the 37 largest values of molecular descriptors were selected using VIP approach. 2 components were generated using CV, however, we reported 3 components to be equal to that of calibration model. The first components explained 77.80% of the variance in MP and 55.30% of the variance in molecular, the second component explained 2.30% of the variance in MP and 17.80% variance in calculated descriptors. In PLSR model, the two components captured main information from the variance; the first component was dominated by molecular interaction, and the second component was mainly related to polar interaction. The same model with the correct data provided following as: 2 2 n=93, R =88.80, Q LOO=85.90, RMSE=31.61, RMSEExt=25.81, 2 Q Ext=52.87 IDEA: MLR model was developed using 3 variables, and the model was defined as:

2 F MP = 2.34(±0.28)ln W – 164.76(±16.51)W rel + 33.14(±8.90)NHBDON + 79.38 (41)

HMGU: 93 compounds were used to develop ASNN model calculated using 5-fold CV, the increase of the training set size improved performance of model and reduce error. QSPR Models with Boiling Point BP models were developed using 93 compounds, and the data were split into training set and test set by means of SOM and random by response activity, 25 compounds for PERFORCE were used for external validation to determine the prediction power of models.

42 Table 4. QSPR models for BP 2 2 2 2 2 2 Group R Q LOO Q BOOT RMSETR RMSEExt Q F1 Q F3 Q Ext UI 93.98 92.92 92.75 19.73 30.32 - - - LNU 97.09 94.60 - 14.11 21.92 - - 90.47 IDEA 95.40 94.10 - 17.25 22.56 96.20 92.10 - HMGU 85.00 - - 32.00 22.00 - - -

UI: The MLR models were developed with 4 variables using OLS approach, the full model with 93 compounds is presented as:

BP = 137.04(±8.51)ATS1m – 76.33(±8.75)Ms + 102.69(±12.06)nROH – 1.96(±0.89)AMW + 92.11(±52.5) (42)

LNU: In the model containing error data, the 149 largest descriptors were selected using VIP approach, 3 components were calculated using CV, however, we reported 4 components to equal to those generated from calibration model. In the PLSR model, the first component explained 83.50% of the variance in BP and 30.80% of the variance in the molecular descriptors, and these descriptors were related to molecular interactions. The second component explained additional 9.86% of the variance in BP and 10.60% variance in molecular descriptors, and these descriptors were mainly related to polar interactions. The remaining component only explained 3.16% of the variance in BP and 7.36% in descriptors. The same model with the correct data provided following as: 2 2 n=93, R =97.04, Q LOO=94.50, RMSE=14.13, RMSEExt=21.92, 2 Q Ext=90.47 IDEA: The MLR models were developed using 7 variables, the equation is expressed as:

1/4 F BP = 138.44(±4.24)MW – 220.98(±11.25)W rel + 42.83(±6.64)NHBDON – 24.52(±4.32)N[Cl,Br,I] – 32,67(±5.68)NCOC – 50.08(±9.85)NC(=O)([!O])[!O] – 18.58(±6.71)NN – 188.36 (43)

HMGU: 93 compounds were used to develop ASNN model calculated using 5-fold CV, the full model showed 32.00 of RMSE. Applicability Domain QSPR models should be verified for their applicability with regard to chemical applicability domain (AD) to produce reliable predicted data for chemicals that are not too structurally dissimilar. An additional 303 compounds for MP and 271 compounds for BP were selected for the structural study. UI and IDEA: The Williams plot (Papa et al., 2005) was used to verify the presence of response outliers. The leverage approach was employed for the

43 definition of the structural chemical domain of each model for new chemicals without experimental data (Gramatica et al., 2004). UI: For MP, 4 compounds in training set and 16 compounds in test set were out of the AD; for BP, 4 compounds in training set and together 60 compounds in prediction set are out of AD. IDEA: 5 compounds for MP and 115 compounds for BP were out of AD. LNU: The AD of the PLSR model was determined by the residual standard deviation and the leverage to that of the calibration objects. From these distances the SIMCA software estimate the probability that a new compound belongs to the model (PModXPS and PModXPS+). Here we used a probability of belonging to the model at the 99% significant level (PModXPS < 0.01 and PModXPS+ < 0.01) as the cut-off criteria to decide if a compound was outside of the model AD. For MP, 20 compounds in calibration set and 145 compounds in prediction set were out of AD; for BP, 20 compounds in calibration set and 171 compounds in prediction set were out of AD. HMGU: The distance to model based on standard deviation of ensemble prediction was used. The distance to model was calibrated using 5-fold cross- validation values, and then the new compounds can be accurately predicted using the calibrated distance to model. The AD of models was identified by using the standard deviation that cover 95% of compounds from the training set. 91 compounds for MP and 80 compounds for BP were out of AD. Different multivariate techniques were used to develop QSPR models of MP and BP, and each of the methods exhibited its own significance and was useful in different aspects. MLR is well known for its simplicity, less demanding statistical methods and facile interpretation. On the other hand, PLSR can overcome collinear variables and missing data at low dimensionality, and it is possible to interpret the physicochemical principal behind the models. NN can model the data precisely even for non-linearities.

QSRR Models Solvents play critical roles in determining the outcome of many molecular events, and the selection a suitable mobile phase constituents is crucial. Our selection of solvents was carried out with the intention of distributing them in the periphery of the PCA space to ensure a maximum spread in the properties of the selected solvents (Carlson et al., 1988). A global LSER was constructed from local LSER and LSST, further a quadratic term was added to compensate for non-linearities (Liu et al., 2010):

2 log 21 φφ +++++++= vVbBaAsSeESSck (44)

44 The theoretical molecular descriptors were correlated with the retention factors. To compare the similarity and difference between the PLSR and the global LSER models, five dimensions were selected in PLSR model equaling the numbers of parameters in the global LSER model. The mathematical relationship could be expressed using score t and loading q in a PLSR model:

log += + + + + 5544332211 + etqtqtqtqtqck (45)

where e is residual of the model. The components were determined using segmented cross-validation (CV), in which the numbers of CV groups were divided into nine segments corresponding to the nine solvents. Each solvent is present at four different concentrations ensuring dielectric constant values of 40, 50, 60 and 70. Accordingly, the data set for model development was based upon nine solvent systems, at four dielectric constant values, in conjunction with three stationary phases. Global LSER Models Global LSER model for the C8 column: the data of seven solvents were used for calibration and remain two solvent ethanol and acetone used for investigating the predictive power of the model. The parameter E, S and A were determined to be non-significant, and E and S were excluded, as a result, the model was improved and 95.46% of the variance was explained with an RMSEC of 0.1769. The calibration model can be expressed as:

Ck −= Conc + 665.1142.5461.3)8(log Conc2 −+− 055.0407.4067.2 VBA (46)

2 Subsequently, the explained prediction variance (Q Ext) was determined to be 97.47% with an RMSEP of 0.1271.

Global LSER model for the C12 column: again, seven solvents were used for calibration, exclusion of the parameters E and S, resulted in an almost the same proportion of explained variance, 97.56% and a corresponding RMSEC of 0.1306. The estimated global LSER for the C12 column was determined to be:

Ck −= Conc + 976.1611.5649.3)12(log Conc2 −+− 051.0953.3665.1 VBA (47)

2 The explained prediction variance (Q Ext) was determined to be 99.55% with an RMSEP of 0.0606.

45 Global LSER model for the C18 column: after excluding parameters E and S, 97.30% of the variance was explained with an RMSEP of 0.1322. The estimated global LSER for the C18 column was determined to be:

Ck −= Conc + 537.1028.5401.3)18(log Conc2 −+− 051.0071.4697.1 VBA (48)

2 The explained prediction variance (Q Ext) was determined to be 98.31% with an RMSEP of 0.1026. QSRR Models Using Semi-Empirical Descriptors A QSRR model was constructed using three dependent variables and eighteen independent variables including the Snyder polarity index (SPI), concentration, and sixteen solvent scales generated from spectroscopic measurements and computational calculations by Katritzky et al. (Katritzky et al., 2005). Five PLSR components were calculated using nine cross-validation (CV) segments, however, the first two components did not capture the molecular descriptors providing the greatest impact on variation, and therefore, we used Marten’s uncertainty test to select the significant variables. Marten’s uncertainty testing is based on cross-validation (CV), jack-knifing and stability plots to identify non-significant variables (Martens et al., 2001). As a result, three significant variables were identified by this uncertainty test. The calibration model accounted for 87.01% of variance, with an RMSEC of 0.3154. Subsequently, 95.75% variation in the model was predicted, and RMSEP was 0.1645. QSRR Models Using Theoretical Molecular Descriptors In PLSR model, the theoretical molecular descriptors were generated as 1664 independent variables and the three retention factors were used as dependent variables. The significant variables were identified using variable importance in the projection (VIP) approach, as a result, 36 descriptors were selected. Five components were used corresponding to five parameters in global LSER model. 96.77% variance in retention factors was described, and 81.40% of the variance was predicted in the calibration model, an RMSEC of 0.1644 was obtained. The predictive power of the calibration model was examined, and a 2 Q Ext was predicted to 96.45% with an RMSEP of 0.1508.

46 Table 5. Calibration and prediction models. Calibration Prediction 2 2 2 Methods Variables RMSEC R (%) Q (%) RMSEP R Ext(%) LSER(C8) MLR 5 0.1769 95.46 83.10 0.1271 97.47 LSER(C12) MLR 5 0.1306 97.56 95.30 0.0606 99.55 LSER(C18) MLR 5 0.1332 97.30 92.30 0.1026 98.31 SPI(C8) PLS 3 0.3341 86.82 70.40 0.2439 91.73 SPI(C12) PLS 3 0.3273 86.85 81.50 0.1022 98.51 SPI(C18) PLS 3 0.3155 87.01 82.50 0.1645 95.75 SPI PLS 3 0.3154 87.01 81.30 0.1645 95.75 TMDa)(C8) PLS 36/5 b) 0.2061 95.47 58.30 0.1736 96.10 TMD(C12) PLS 36/5 b) 0.1845 96.17 81.60 0.1005 98.48 TMD(C18) PLS 36/5 b) 0.1641 96.78 80.30 0.1525 96.42 TMD PLS 36/5 b) 0.1644 96.77 81.40 0.1508 96.45 a) Theoretical molecular descriptors, b) the model at 5 components.

A global LSER model combined with the LSST and local LSER model, not only precisely predicted retention, but also directly provided the information of positive or negative effects of the parameters on retention factors. The coefficients, conc2, b, s and e, showed positive values, implying the solvent system favors analyte interaction with the stationary phase thus contributing to increased retention. The parameters, conc, a and v showed negative values, which reflect a more favorable interaction between analyte and the mobile phase, leading to decreased retention. The coefficients, v and b, are significant in global LSER model. This result was in accordance with the free energy transfer that sums van der Waals and polar interactions.

Table 6. Full models Methods Variables RMSEE R2 (%) Q2 (%) LSER(C8) MLR 5 0.2042 95.91 82.70 LSER(C12) MLR 5 0.1461 97.94 96.60 LSER(C18) MLR 5 0.1598 97.50 93.80 SPI PLS 3 0.2843 88.78 83.30 TMD a) PLS 36/5 b) 0.1598 96.68 86.20 a) Theoretical molecular descriptors, b) the model at 5 components.

In global LSER model, the retention factors for single column were analyzed each, and therefore, the system deviations from the model cannot be investigated simultaneously. The parameters E and S depended on polarizability and caused to covariance, as a result, it is possible to lead to grossly inflated uncertainties in global LSER coefficients (Vitha and Carr, 2006). In semi-empirical and theoretical QSRR models using PLSR approach, the retention factors for three columns can be analyzed simultaneously, and the system deviation can be investigated by model error. The first two components can capture the information from the retention factors, the first

47 component was strongly related to the McGowan molecular volume, and the second component described polarizability. The result was in accordance with that generated from global LSER. The models aid in the development of simple, rapid, and sensitive approaches for water quality assessment, and also provide a basis for future in the development of chromatographic methods.

48 Conclusions

The physicochemical properties of chemical compounds between different phases can be successfully described by the modeling of the structure information. The global PLS model for vapor pressure can accurately describe the variation of organic compounds in liquid or subcooled liquid with the help of theoretical molecular descriptors. However, the local model performs better with only a few latent variables. Our model (Paper I) can in a similar fashion be combined with the model for environmental persistence to assess the potential for a long-range atmospheric transport (Öberg, 2005) PFCs appeared as outliers in the global PLS model. In the next step, we therefore updated the previous PLS model to cover these compounds, through a weighted recalibration. The updated model could be readily interpreted and the predictive error for liquid vapor pressureof PFCs was within 0.2 log units of Pa. the updated PLS model was subsequently applied to estimate the vapor pressures of more than 200 PFCs where no reliable experimental data are available (Paper III). LSER and PLS models were developed for n-octanol/water partition coefficients (logKow) and water solubility (logSw) using Abraham parameters and theoretical molecular descriptors, respectively. The LSER showed a slightly better performance in logKow, but the descriptor part was generated from estimation or calculation through the equations. On the other hand, the PLS model with theoretical molecular descriptors exhibited better performance in logSw, SD cannot be lower than ~0.50 log unit in LSER containing compounds with widely different structures, and our results correspond to those of Abraham and Le (Abraham and Le, 1999). Not all the variation generated from the data of descriptors is orthogonal to each other, but we could successfully interpret the PLS model through a rotate or re- project approach (Paper II). In a collaborative study by CADASTER project partners, several QSPR models were developed for the melting and boiling points of PFCs. Different multivariate techniques was applied such as linear, nonlinear and fragment based approaches. These models are robust, predictive individually, and useful from different aspects. MLR is a simple statistical approach, where physicochemical properties are readily interpreted. PLS, on the other hand, overcomes collinearities and reduces model dimensionality, and sensitively measure applicability domain. Neural networks can model nonlinearities in the data very precisely (Paper IV). Global LSER, semi-empirical, and theoretical molecular QSRR model using PLS approaches were shown to successfully predict retention for new solvents in a RPLC system. The global LSER model was based on the assumption that all the parameters are completely relevant, however, MLR

49 cannot provide a diagnostic tool for validating this assumption. Further, the parameters E and S depended on polarizability and lead to covariance. Nevertheless, three columns can be investigated simultaneously; as a result, the systematic deviation can be diagnostic by model error (paper V). LSER encodes van der Waals and polar interactions with a limited number of descriptors for dispersive forces, hydrogen bonding, dipolarity/polarizability, and cavity formation. The LSER model directly exhibits which parameter generates a significant effect. As a result, interpretation becomes easier. However, as there are only approximately 4000 compounds with descriptors (Abraham, 1993a; Abraham, 1993b), it is restricted in the development of a general model. The PLS model using theoretical molecular descriptors does not have this limitation. In PLS models, the first two components can successfully capture main information for the physicochemical properties logPa, logKow, logSw, MP, and MP, as well as retention factor logk in RPLC. Also, previous research indicates that a direct chemical interpretation of the PLS components is possible (Öberg, 2007). The first component is strongly correlated to McGowan molecular volume, which is equivalent to the van der Waals volume. Hydrogen bond basicity and polarizability have been considered to be crucial factors and were described by the second component. The results showed that the free energy of transfer is mainly dominated by van der Waals and polar interactions, and the results are in accordance with those generated from LSER. The PLS models with theoretical molecular descriptors not only predict the chemical properties (Öberg and Liu, 2008; Liu and Öberg, 2009; Öberg and Liu, 2010), but also the chromatographic retention behavior (Liu et al, 2010). The PLS models for physicochemical properties and retention factors can be used to validate the experimental results. Moreover, modeling is a rapid, low-cost, and less time consuming alternative. Importantly, the PLS approach can tolerate randomly missing or collinear data, and it possesses a higher power of prediction for organic compounds without any synthesis, measurement, or sampling. A potential use and benefit of the models presented here is to predict environmentally relevant properties and thereby facilitate the development of environmentally friendly products. In this thesis, chemoinformatics has been used to investigate the behavior of several series of chemical substance of environmental significance. The models derived describing these substances should aid in the prediction of chemical behavior of these, and other as yet unstudied substances. Such prognostic tools should aid in the development of greener chemistry.

50 Acknowledgements

First of all I would like to express the greatest thanks to my supervisor, Tomas Öberg, for your constant guidance, support, and motivation both in my research and life, and I would also like to thank Christina Öberg for your constant encourage. Further, I would like to thank my co-supervisor Ian A. Nicholls for your valuable discussions, guidance, as well as barbecue in your pleasant summer house, of course, I am very grateful to Karina Adbo and your family.

Many thanks to Jesper Karlsson, for your preparation of instruments and chemicals in the laboratory, and also to Håkan Andersson, Siamak Shoravi and Annika Rosengren for pleasant discussions in the chemistry laboratory.

To the colleagues in the project CADASER of FP7, Barun Bhhatarai, Igor V. Tetko, Wolfram Teetz, Paola Gramatica, Nina Jeliazkova, Nikolay Kochev, Ognyan Pukalov, Simona Kovarich, many thanks for your pleasant collaboration and valuable discussion, as well as permission of our manuscript collected in my thesis.

I am grateful to Monika Filipsson, Edna Granéli, Roland Engkvist, Stina Alriksson, Catherine Legrand, Anna Augustsson, Margaretha Bjerker, Lena Pettersson, Pasi Peltola, Claes-Göran Alriksson and Anders Månsson for always providing help during my living and working in Kalmar.

Many thanks to all the colleagues and PhD students in the School of Natural Sciences, in particularly, the PhD students in the chemistry laboratory, for your creation of friendly and academic atmosphere.

Finally, I wish to thank my mother, an analytical , for encouraging my interest in chemistry, business, society and giving constant love, and I wish to thank my father and my sister Bo Liu, for persistently taking care of our family, as well as encouraging and supporting me to focus on my research and pursue my future.

51 References

Abraham, M.H. and McGowan, J.C. (1987) The Use of Characteristic Volumes to Measure Cavity Terms in Reversed Phase Liquid-Chromatography. Chromatographia, 23(4), 243-246. Abraham, M.H., Grellier, P.L., Prior, D.V., Duce, P.P., Morris, J.J., and Taylor, P.J. (1989) Hydrogen-Bonding. 7. A Scale of Solute Hydrogen-Bond Acidity Based on LogK-Value for Complexation in Tetrachloromethane. J. Chem. Soc.-Perkin Trans. 2, (6), 699-711. Abraham, M.H., Grellier, P.L., Prior, D.V., Morris, J.J., and Taylor, P.J. (1990) Hydrogen-Bonding. 10. A Scale of Solute Hydrogen-Bond Basicity Using LogK Values for Complexation in Tetrachloromethane. J. Chem. Soc.-Perkin Trans. 2, (4), 521-529. Abraham, M.H. (1993a) Scales of Solute Hydrogen-Bonding: Their Construction and Application to Physicochemical and Biochemical Processes. Chem. Soc. Rev., 22(2), 73- 83. Abraham, M.H. (1993b) Hydrogen-Bonding. 31. Construction of a Scale of Solute Effective or Summation Hydrogen-Bond Basicity. J. Phys. Org. Chem., 6(12), 660-684. Abraham, M.H., Chadha, H.S., Whiting, G.S., and Mitchell, R.C. (1994) Hydrogen Bonding. 32. An Analysis of Water-Octanol and Water-Alkane Partitioning and the Δlog P Parameter of Seiler. J. Pharm. Sci., 83(8), 1085-1100. Abraham, M.H., and Le, J. (1999) The Correlation and Prediction of the Solubility of Compounds in Water Using an Amended Solvation Energy Relationship. J. Pharm. Sci., 88(9), 868-880. Abraham, M.H., Ibrahim, A., and Zissimos, A.M. (2004) Determination of Sets of Solute Descriptors from Chromatographic Measurements. J. Chromatogr. A, 1037(1-2), 29-47. Anastas, P.T. and Warner, J.C. (1998) Green Chemistry: Theory and Practice. Oxford University Press Inc., New York, USA. Armitage, J.M., Macleod, M. and Cousins, I.T. (2009) Comparative Assessment of the Global Fate and Transport Pathways of Long-Chain Perfluorocarboxylic Acids (PFCAs) and Perfluorocarboxylates (PFCs) Emitted from Direct Sourses. Environ. Sci. Technol., 43(15), 5830-5836. Bhhatarai,B., Teetz, W., Liu, T., Öberg, T., Jeliazkova, N., Kochev, N., Pukalov, O., Tetko, I.V., Kovarich, S. and Gramatica, P. (2010) CADASTER QSPR Models for Predictions of Melting and Boiling Points of Perfluorinated Chemicals. Manuscript. Bereton, R.G. (2007) Applied Chemometrics for Scientists. John Wiley & Sons Ltd. England. Brown, F. (2005) Chemoinformatics - A Ten Year Update - Editorial Opinion. Curr. Opin. Drug Discov. Dev., 8(3), 298-302. Brown, F.K. (1998) Chapter 35. Chemoinformatics: What Is It and How Does It Impact Drug Discovery. Annual Reports in Medicinal Chemistry, 33, 375-384. Brown, N. (2009) Chemoinformatics-An Introduction for Computer Scientists. ACM Comput. Surv., 41(2), 8:1-8:38. Brown, R.J.C. and Brown, R.F.C. (2000) Melting Point and Molecular Symmetry. J. Chem. Educ., 77(6), 724-731.

52 Brown, V.J. (2003) REACHing Chemical Safety. Environ. Health Perspect. 111(14), A766-A769. Bunin, B.A., Siesel, B., Morales, G.A. and Bajorath, J. (2007) Chemoinformatics: Theory, Practice, & Products. Springer, The Netherlands. Carlson, R., Prochazka, M.P. and Lundstedt. T. (1988) Principal Properties for Synthetic Screening: Ketones and Aldehydes. Acta Chem. Scand. B, 42(3), 145-156 Capron, X., Walczak, B., de Noord, O.E. and Massart, D.L. (2005) Selection and Weighting of Samples in Multivariate Regression Model Updating. Chemomterics Intell. Lab. Syst., 76(2), 205-214. Chen, W.L. (2006) Chemoinformatics: Past, Present, and Future. J. Chem. Inf. Model, 46(6), 2230-2255. Clark, J.H. (1999) Green Chemistry: Challenges and Opportunities. Green Chem., 1(1), 1-8. Clark, J.H. (2006) Green Chemistry: Today (and Tomorrow). Green Chem., 8(1), 17-21. Consonni, V., Ballabio, D. and Todeschini, R. (2009) Comments on the Definition of the Q(2) Parameter for QSAR Validation. J. Chem. Inf. Model, 49(7), 1669-1678. Consonni, V., Todeschini, R. and Pavan, M. (2002) Structure/Response Correlations and Similarity/Diversity Analysis by GETAWAY Descriptors. 1. Theory of the Novel 3D Molecular Descriptors. J. Chem. Inf. Comput. Sci., 42(3), 682-692. Crum, Brown A. and Frazer, T. (1868) On the Connnection between Chemical Constitution and Physiological Action. Part 1. On the Physiological Action of the Salts of the Ammonia Bases Devived from Strycnia, Brucia, Thebaia, Codeia, Morphia, and Nicotinia. T. Roy. Soc. Edin., 25(1-53), 151-203. Dearden, J.C., Cronin, M.T.D., Zhao Y.H. and Raevsky, O.A. (2000) QSAR Studies of Compounds Acting by Polar and Non-Polar Narcosis: An Examination of the Role of Polarisability and Hydrogen Bonding. Quant. Struct.-Act. Relat., 19(1), 3-9. Dimitrov, S., Dimitrova, G., Pavlov, T., Dimitrova, N., Patlewicz, G., Niemela, J. and Mekenyan, O. (2005) A Stepwise Approach for Defining the Applicability Domain of SAR AND QSAR Models. J. Chem. Inf. Model., 45(4), 839-849. Dinglasan-Panlilio, M.J.A. and Mabury, S.A. (2006) Significant Residual Fluorinated Alcohols Present in Various Fluorinated Materials. Environ. Sci. Technol. 40(5), 1447- 1453. Dunn, W.J. (1989) Quantitative Structure-Activity Relationships (QSAR). Chemometrics Intell. Lab. Syst., 6(3), 181-190. Efron, B. (1979) 1977 Rietz Lecture - Bootstrap Methods - Another Look at the Jackknife. Ann. Stat., 7(1), 1-26. Eisenberg, D. and McLachlan, A.D. (1986) Solvation Energy in Protein Folding and Binding. Nature, 319(6050), 199-203. Eriksson, L., Jaworska, J., Worth, A.P., Cronin, M.T.D., McDowell, R.M. and Gramatica, P. (2003) Methods for Reliability and Uncertainty Assessment and for Applicability Evaluations of Classification- and Regression-Based QSARs. Environ. Health Perspect., 111(10), 1361-1375. Eriksson, L., Johansson, E., Kettaneh-Wold, N., Trygg, J., Wikström, C. and Wold, S. (2006) Multi- and Megavariate Data Analysis. Part 1. Basic Principles and Applications. Umetrics AB, Umeå, Sweden. Ettre, L.S. (1996) M.S. Tswett and the 1918 Nobel Prize in Chemistry. Chromatographia, 42(5-6), 343-351.

53 Ettre, L.S. and Sakodynskii, K.I. (1993) Tswett, M.S. and the Discovery of Chromatography. 2. Completion of the Development of Chromatography (1903-1910). Chromatographia, 35(5-6), 329-338. Fredenslund, A., Jones, R.L. and Prausnitz, J.M. (1975) Group-Contribution Estimation of Activity Coefficients in Nonideal Liquid Mixtures. Aiche J., 21(6), 1086-1099. Gajewski, R.P.,Thompson, G.D. and Chio, E.H. (1988) Perfluorinated Alkyl Carboxanilides - A New Class of Soil Insecticide. J. Agr. Food Chem. 36(1), 174-177. Gasteiger, J. and Engel, T. (2003) Chemoinformatics: A Textbook. WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim, Germany. Gasteiger, J. (2006) Chemoinformatics: A New Field with a Long Tradition. Anal. Bioanal. Chem., 384(1), 57-64. Gasteiger, J. (2006) The Central Role of Chemoinformatics. Chemometrics Intell. Lab. Syst., 82(1-2), 200-209. Ghose, A.K. and Crippen, G.M. (1987) Atomic Physicochemical Parameters for Three Dimensional-Structure-Directed Quatitative Structure-Activity Relationships. 2. Modeling Dispersive and Hydrophobic Interactions. J. Chem. Inf. Comput. Sci., 27(1), 21-35. Giesy, J.P. and Kannan, K. (2002) Perfluorochemical Surfactants in the Environment. Environ. Sci. Technol., 36(7), 146A-152A. Gramatica, P. (2007) Principles of QSAR Models Validation: Internal and External. QSAR Comb. Sci., 26(5), 694-701. Gramatica, P., Pilutti, P. and Papa, E. (2004) Validated QSAR Prediction of OH Tropospheric Degradation of VOCs: Splitting into Training-Test Sets and Consensus Modeling. J. Chem. Inf. Comput. Sci., 44(5), 1794-1802. Guruge, K.S., Yeung, L.W.Y., Yamanaka, N., Miyazaki, S., Lam, P.K.S., Giesy, J.P., Jones, P.D. and Yamashita, N. (2006) Gene Expression Profiles in Rat Liver Treated with Perfluorooctanoic Acid (PFOA). Toxicol. Sci., 89(1), 93-107. Hagen, D.F., Belisle, J., Johnson, J.D. and Venkateswarlu, P. (1981) Characterization of Fluorinated Metabolites by a Gas Chromatographic-Helium Microwave Plasma Detector-the Biotransformation of 1H,1H,2H,2H-Perfluorodecanol to Perfluorooctanoate. Anal. Biochem., 118(2) 336-343. Hall, L.H. and Kier, L.B. (1995) Electrotopological State Indices for Atoms Types: A Novel Combination of Electronic, Topological, and Valence State Information. J. Chem. Inf. Comput. Sci., 35(6), 1039-1045. Hammett, L.P. (1935) Some Relations between Reaction Rates and Equilibrium Constants. Chem. Rev., 17(1), 125-136. Hammett, L.P. (1937) The Effect of Structure Upon the Reactions of Organic Compounds. Benzene Derivatives. J. Am. Chem. Soc., 59, 96-103. Hann, M. and Green, R. (1999) Chemoinformatics - a New Name for an Old Problem? Curr. Opin. Chem. Biol., 3(4), 379-383. Hansch, C., Maloney, P.P., Fujita, T. and Muir, R.M. (1962) Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients. Nature, 194(4824), 178-180. Hansch, C. (1969) A Quantitative Approach to Biochemical Structure-Activity Relationships. Accounts Chem. Res., 2(8), 232-239. Hansch, C. and Leo, A. (1979) Substituent Constants for Correlation Analysis in Chemistry and Biology. Wiley-Interscience, New York, USA.

54 Hansen, H.K., Rasmussen, P., Fredenslund, A., Schiller, M. and Gmehling, J. (1991) Vapor-Liquid Equilibria by UNIFAC Group Contribution. 5. Revision and Extension. Ind. Eng. Chem. Res., 30(10), 2352-2355. Hendricks, J.O. (1953) Industrial Fluorochemicals. Ind. Eng. Chem., 45(1), 99-105. Hinckley, D.A., Bidleman, T.F., Foreman, W.T. and Tuschall, J.R. (1990) Determination of Vapor Pressure for Nonpolar and Semipolar Organic Compounds from Gas Chromatographic Retention Data. J. Chem. Eng. Data, 35(3), 232-237. Horton, B. (1999) Green Chemistry Puts Down Roots. Nature, 400(6746): 797-799. Hoskuldsson, A. (2001) Causal and Path Modelling. Chemometrics Intell. Lab. Syst., 58(2), 287-311. Hotelling, H. (1931) The Generation of Student's Ratio. The Annals of Mathematical Statistics. 2(3), 360-378. Jandera, P., Colin, H. and Gulochon, G. (1982) Interaction Indexes for Prediction of Retention in Reversed-Phase Liquid Chromatography. Anal. Chem., 54(3), 435-441. Kaiser, M., Cobranchi, D., Kao, C., Krusic, P., Marchione, A. and Buck, R. (2004) Physicochemical Properties of 8-2 Fluorinated Telomer B Alcohol. J. Chem. Eng. Data, 49(4), 912-916. Kaiser, M., Larsen, B., Kao, C. and Buck, R. (2005) Vapor Pressures of Perfluorooctanoic, -Nonanoic, -Decanoic, -Undecanoic, and -Dodecanoic Acids. J. Chem. Eng. Data, 50(6), 1841-1843. Kaliszan, R. and Koks, H. (1977). Relationship between RM Values and Connectivity Indexes for Pyrazine Carbothioamide Derivatives. Chromatographia, 10(7), 346-349. Kaliszan, R. (1977) Correlation between Retention Indexes and Connectivity Indexes of Alcohols and Methyl-Esters with Complex Cyclic Structure. Chromatographia, 10(9), 529-531. Kaliszan, R. (1992) Quatitative Structure-Retention Relationships. Anal. Chem., 64(11), 619-631. Kamlet, M.J. and Taft, R.W. (1976) The Solvatochromic Comparison Method. 1. The β- Scale of Solvent Hydrogen-Bond Acceptor (HBA) Basicities. J. Am. Chem. Soc., 98(2), 377-383. Kamlet, M.J., Abboud, J.L., and Taft, R.W. (1977) The Solvatochromic Comparison Method. 6. The π* Scale of Solvent Polarities. J. Am. Chem. Soc., 99(18), 6027-6038. Kamlet, M.J. and Taft, R.W. (1985) Linear Solvation Energy Relationships. Local Empirical Rules-or Fundamental Laws of Chemistry? A Reply to the Chemometricians. Acta Chem. Scand., B, Org. Chem. Biochem., 39(8), 611-628. Kamlet, M.J., Doherty, R.M., Famini, G.R. and Taft, R.W. (1987) Linear Solvation Energy Relationships. Local Empirical Rules or Fundamental Laws of Chemistry? The Dialog Continues. A Challenge to the Chemometricians. Acta Chem. Scand., B, Org. Chem. Biochem., 41(8), 589-598. Kan, A.T. and Tomson, M.B. (1996) UNIFAC Prediction of Aqueous and Nonaqueous Solubilities of Chemicals with Environmental Interest. Environ. Sci. Technol., 30(4), 1369-1376. Kannan, K., Newsted, J., Halbrook, R.S. and Giesy, J.P. (2002) Perfluorooctanesulfonate and Related Fluorinated Hydrocarbons in Mink and River Otters from the United States. Environ. Sci. Technol., 36(12), 2566-2571. Karelson, M. (2000) Molecular Descriptors in QSAR/QSPR. John Wiley & Sons, Inc.

55 Katritzky, A.R., Fara, D.C., Kuanar, M., Hur, E. and Karelson, M. (2005) The Classification of Solvents by Combining Classical QSPR Methodology with Principal Component Analysis. J. Phys. Chem. A, 109(45), 10323-10341. Kettaneh, N., Berglund, A., and Wold, S. (2005) PCA and PLS with Very Large Data Sets. Computational statistics & Data Analysis, 48(1), 69-85. Kohonen, T. (1982) Self-Organized Formation of Topologically Correct Feature Maps. Biol. Cybern., 43(1), 59-69. Krusic, P., Marchione, A., Davidson, F., Kaiser, M., Kao, C., Richardson, R., Botelho, M, Waterland, R. and Buck, R. (2005) Vapor Pressure and Intramolecular Hydrogen Bonding in Fluorotelomer Alcohols. J. Phys. Chem. A, 109(28), 6232-6241. Kurup, A., Garg, R., Carini, D.J. and Hansch C. (2001) Comparative QSAR: Angiotensin II Antagonists. Chem. Rev., 101(9), 2727-2750. Kvalheim, O.M. (1988) A Partial-Least-Squares Approach to Interpretative Analysis of Multivariate Data. Chemom. Intell. Lab. Syst., 3(3), 189-197. Leardi, R., Boggia, R and Terrile, M. (1992) Genetic Algorithms As a Strategy for Feature-Selection. J. Chemometr. 6(5), 267-281. Lemal, D.M. (2004) Perspective on Fluorocarbon Chemistry. J. Org. Chem., 69(1), 1-11. Leach, A.R. and Gillet V.J. (2003) An Introduction to Chemoinformatics. Springer, The Netherlands. Lindberg, W., Persson J.A. and Wold, S. (1983) Partial Least-Squares Method for Spectrofluorimetric Analysis of Mixtures of Humic Acid and Ligninsulfonate. Anal. Chem., 55(4), 643-648. Liu, C.S., Yu, L., Deng, J., Lam. P.K.S., Wu., R.S.S. and Zhou, B.S. (2009) Waterborne Exposure to Fluorotelomer Alcohol 6:2 FTOH Alters Plasma Sex Hormone and Gene Transcription in the Hypothalamic-Pituitary-Gonadal (HPG) Axis of Zebrafish. Aquat. Toxicol., 93(2-3), 131-137. Liu, T. and Öberg, T. (2009) Modeling of Partition Constants: Linear Solvation Energy Relationships or PLS Regression? J. Chemometr., 23(5-6), 254-262. Liu, T., Nicholls, I.A. and Öberg, T. (2010) Comparison of Theoretical and Experimental Models for Charactrizing Solvent Properties Using Reversed Phase Liquid Chromatography (RPLC). Submitted. Mackay, D., Shiu, W.Y., Ma, K.C. and Lee, S.C. (2006) Physico-Chemical Properties and Environmental Fate for Organic Chemicals, 2nd. CRC Press, Boca Raton, FL, USA. Martens, H., Hoy, M., Westad, F., Folkenberg, D. and Martens. M. (2001) Analysis of Designed Experiments by Stabilised PLS Regression and Jack-knifing. Chemometr. Intell. Lab. Syst. 58(2), 151-170. Martens, H. and Næs, T. (1996) Multivariate Calibration. John Wiley & Sons Ltd. Martin, J.W., Mabury, S.A. and O’Brien, P.J. (2005) Metabolic Products and Pathways of Fluorotelomer Alcohols in Isolated Rat Hepatocytes. Chem. Biol. Interact. 155(3), 165- 180. Meylan, W.M. and Howard, P.H. (1995) Atom Fragment Contribution Method for Estimating Octanol-Water Partion-Coefficients. J. Pharm. Sci., 84(1), 83-92. Mitchell, B.E. and Jurs, P.C. (1998) Prediction of Infinite Dilution Activity Coefficients of Organic Compounds in Aqueous Solution from Molecular Structure. J. Chem. Inf. Comput. Sci., 38(2), 200-209.

56 Moody, C.A. and Field, J.A. (2000) Perfluorinated Surfactants and the Environmental Implications of Their Use in Fire-Fighting Foams. Environ. Sci. Technol. 34(18), 3864- 3870. Müller, M. and Klein, W. (1993) Fugacity Calculations Using Estimated Physicochemical Properties. SAR QSAR Environ. Res., 1(2-3), 245-262. Myrdal, P., Ward, G.H., Simamora, P. and Yalkowsky, S.H. (1993) AQUAFAC: Aqueous Functional Group Activity Coefficients. SAR and QSAR in Environmental Research, 1(1), 53-61. Netzeva, T.I., Worth, A., Aldenberg, T., Benigni, R., Cronin, M.T., Gramatica, P., Jaworska, J.S., Kahn, S., Klopman, G., Marchant, C.A., Myatt, G., Nikolova- Jeliazkova, N., Patlewicz, G.Y., Perkins, R., Roberts, D., Schultz, T., Stanton, D.W., van de Sandt, J.J., Tong, W., Veith, G., and Yang, C. (2005) Current Status of Methods for Defining The Applicability Domain of (Quantitative) Structure-Activity Relationships. The Report and Recommendations of ECVAM Workshop 52. Altern. Lab. Anim., 33(2), 155-173. Öberg, T. (2004) A QSAR for Baseline Toxicity: Validation, Domain of Application, and Prediction. Chem. Res. Toxicol., 17(12), 1630-1637. Öberg, T. (2005) A QSAR for the Hydroxyl Radical Reaction Rate Constant: Validation, Domain of Application, and Prediction. Atmos. Environ., 39(12), 2189-2200. Öberg, T. (2007) A General Structure-Property Relationship to Predict the Enthalpy of Vaporization at Ambient Temperatures. SAR QSAR Environ. Res., 18(1-2), 127-139. Öberg, T. and Liu, T. (2008) Global and Local PLS Regression Models to Predict Vapor Pressure. QSAR Comb. Sci., 27(3), 273-279. Öberg, T. and Liu, T. (2010) Extension of a Prediction Model to Estimate Vapor Pressure of Perfluorinated Compounds (PFCs). Submitted. Olsen, G.W., Church, T.R., Miller, J.P., Burris, J.M., Hansen, K.J., Lundberg, J.K., Armitage, J.B., Herron, R.M., Medhdizadehkashi, Z., Nobiletti, J.B., O'Neill, E.M., Mandel, J.H. and Zobel, L.R. (2003) Perfluorooctanesulfonate and Other Fluorochemicals in the Serum of American Red Cross Adult Blood Donors. Environ. Health Perspect., 111(16), 1892-1901. Papa, E., Villa, F. and Gramatica. P. (2005) Statistically Validated QSARs, Based on Theoretical Descriptors, for Modeling Aquatic Toxicity of Organic Chemicals in Pimephales Promelas (Fathead Minnow). J. Chem. Inf. Model, 45(5), 1256-1266. Pearson, K. (1901) On Lines and Planes of Closest Fit to System of Points in Space. Philos. Mag., 2(6), 559-572. Pinal, R. (2004) Effect of Molecular Symmetry on Melting Temperature and Solubility. Org. Biomol. Chem., 2(18), 2692-2699. Platonov, V.E., Haas, A., Schelvis, M., Lieb, M., Dvornikova, K,V., Osina. O.I. and Gatilov, Y.V. (2001) Polyfluorinated Arylnitramines. J. Fluorine Chem. 109(2), 131-139. Poole, C.F. (2003) The Essence of Chromatography. Elsevier Science B.V. The Netherlands. Poole, C.F., Atapattu, S.N., Poole, S.K. and Bell, A.K. (2009) Determination of Solute Descriptors by Chromatographic Methods. Anal. Chim. Acta., 652(1-2), 32-53. Quarry, M.A., Grob, R.L. and Snyder, L.R. (1986) Prediction of Precise Isocratic Retention Data from Two or More Gradient Elution Runs. Analysis of Some Associated Errors. Anal. Chem., 58(4), 907-917. Quinn, G.P. and Keough, M.J. (2002) Experimental Design and Data Analysis for Biologists. Cambridge University Press, New York, USA.

57 Randić, M. (2001) The Connectivity Index 25 Years After. J. Mol. Graph., 20(1): 19-35. Schofield, H., Wiggins, G. and Willett, P. (2001) Recent Developments in Chemoinformatics Education. Drug Discov. Today, 6(18), 931-934. Schneider, G., Wrede, P. (1998) Artificial Neural Networks for Computer-Based Molecular Design. Prog. Biophys. Mol. Bio., 70(3), 175-222. Schwarzenbach, R.P., Gschwend, P., W. and Imboden, D.M. (2003) Environmental Organic Chemistry. Wiley-Interscience, John Wiley & Sons, Inc., New Jersey, USA. Shi, L.M., Fang, H., Tong, W.D., Wu, J., Perkins, R., Blair, R.M., Branham, W.S., Dial, S.L., Moland, C.I. and Sheehan, D.M. (2001) QSAR Models Using A Large Diverse Set of Estrogens. J. Chem. Inf. Comput. Sci., 41(1), 186-195. Steele, W., Chirico, R., Knipmeyer, S. and Nguyen, A. (2002a) Measurements of Vapor Pressure, Heat Capacity, and Density along the Saturation Line for Cyclopropane Carboxylic Acid, N,N-Diethylethanolamine, 2,3-Dihydrofuran, 5-Hexen-2-One, Perfluorobutanoic Acid, and 2-Phenylpropionaldehyde. J. Chem. Eng. Data, 47(4), 715-724. Steele, W., Chirico, R., Knipmeyer, S. and Nguyen, A. (2002b) Vapor Pressure, Heat Capacity and Density along the Saturation Line: Measurements for Benzenamine, Butylbenzene, Sec-Butylbenzene, Tert-Butylbenzene, 2,2-Dimethylbutanoic Acid, Tridecafluoroheptanoic Acid, 2-Butyl-2-Ethyl-1,3-Propanediol, 2,2,4-Trimethyl-1,3- Pentanediol, and 1-Chloro-2-Propanol. J. Chem. Eng. Data, 47(4), 648-666. Stork, C.L. and Kowalski, B.R. (1999) Weighting Schemes for Updating Regression Models-a Theoretical Approach. Chemometrics Intell. Lab. Syst., 48(2), 151-166. Syder, L.R., Kirkland, J.J. and Glajch, J.l. (1997) Practice HPLC Method Development. Second Edition. John Wiley & Sons, INC. New York. Taft, R.W. (1952) Polar and Steric Substituent Constants for Aliphatic and o-Benzoate Groups from Rates of Esterification and Hydrolysis of Esters. J. Am. Chem. Soc., 74(12), 3120-3128. Tetko, I.V. (2002a) Associate Neural Network. Neural Process. Lett., 16(2), 187-199. Tetko, I.V. (2002b) Neural Network Studies. 4. Introduction to Associative Neural Networks. J. Chem. Inf. Sci., 42(3), 717-728. Tetko, I.V., Bruneau, P., Mewes, H.W., Rohrer, D.C., Poda, G.I. (2006) Can We Estimate the Accuracy of ADME-Tox Predictions? Drug Discov. Today, 11(15-16), 700-707. Tetko, I.V., Sushko, I., Pandey, A.K., Zhu, H., Tropsha, A., Papa, E., Öberg, T., Todeschini, R., Fourches, D. and Varnek, A. (2008) Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena Pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection. J. Chem. Inf. Model, 48(9), 1733-1746. Taft, R.W. and Kamlet, M.J. (1976) The Solvatochromic Comparison Method. 2. The α- Scale of Solvent Hydrogen-Bond Donor (HBD) Acidities J. Am. Chem. Soc., 98(10), 2886-2894. Todeschini, R., Bettiol, C., Giurin, G., Gramatica, P., Miana, P. and Argese E. (1996) Modeling and Prediction by Using WHIM Descriptors in QSAR Studies: Submitochondrial Particles (SMP) as Toxicity Biosensors of Chlorophenols. Chemosphere, 33(1): 71-79. Todeschini, R. and Gramatica, P. (1997) 3D-modelling and Prediction by WHIM Descriptors. Part 5. Theory Development and Chemical Meaning of WHIM Descriptors. Quant. Struct.-Act Relat., 16(2), 113-119.

58 Topliss, J.G. and Edwards, R.P. (1979) Chance Factors in Studies of Quantitative Structure-Activity Relationships. J. Med. Chem., 22(10), 1238-1244. Trepka, R.D., Harrington, J.K., McConville, J.W., McGurran, K.T., Mendel, A., Pauly, D.R., Robertson, J.E. and Waddington, J.T. (1974) Synthesis and Herbicidal Activity of Fluorinated N-Phenylalkanesulfonamides. J. Agr. Food Chem., 22(6), 1111-1119. Tropsha, A., Gramatica, P., and Gombar, V.K. (2003) The Importance of Being Earnest: Validation Is the Absolute Essential for Successful Application and Interpretation of QSPR Models. QSAR Comb. Sci., 22(1), 69-77. Valkó, K. (2004) Application of High-Performance Liquid Chromatography Based Measurements of Lipophilicity to Model Biological Distribution. J. Chromatogr. A, 1037(1-2), 299-310. Wang, A.S. and Carr, P.W. (2002) Comparative Study of the Linear Solvation Energy Relationship, Linear Solvent Strength Theory, and Typical-Conditions Model for Retention Prediction in Reversed-Phase Liquid Chromatography. J. Chromatogr. A, 965(1-2), 3-23. Wang, A.S., Tan, L.C. and Carr. P.W. (1999) Global Linear Solvation Energy Relationships for Retention Prediction in Reversed-Phase Liquid Chromatigraphy. J. Chromatogr. A, 848(1-2), 21-37. Willett, P. (2008) A Bibliometric Analysis of the Literature of Chemoinformatics. Aslib Proceedings: New Information Perspectives, 60(1), 4-17. Viswanadhan, V.N., Ghose, A.K., Revankar, G.R. and Robins, R.K. (1989) Atomic Physicochemical Parameters for Three Dimensional Structure Directed Quatitative Structure-Activity Relationships. 4. Additional Parameters for Hydrophobic and Dispersive Interactions and Their Application for an Automated Superosition of Certain Naturally Occurring Nucleoside Antibiotics. J. Chem. Inf. Comput. Sci., 29(3), 163-172. Vitha, M. and Carr, P.W. (2006) The Chemical Interpretation and Practice of Linear Solvation Energy Relationships in Chromatography. J. Chromatogr. A, 1126(1-2), 143- 194. Wold, S. (1974) A Theoretical Foundation of Extrathermodynamic Relationships (Linear Free Energy Relationships). Chem. Scripta, 5(3), 97-106. Wold, S. (1978) Cross-Validatory Estimation of Number of Components in Factor and Principal Components Models. Technometrics, 20(4), 397-405. Wold, S. and Sjöström, M. (1986) Linear Free Energy Relationships. Local Empirical Rules - or Fundamental Laws of Chemistry. A Reply to Kamlet and Taft. Acta Chem. Scand., B, Org. Chem. Biochem., 40(4), 270-277. Wold, S., Geladi, P., Esbensen, K. and Öhman, J. (1987) Multi-Way Principal Components-and PLS-Analysis. J. Chemometrics, 1(1), 41-56. Wold, S. (1993) Discussion: PLS in Chemical Practice. Technometrics, 35(2), 136-139. Wold, S. (1995) Chemometrics; What Do We Mean with It, and What Do We Want from It? Chemometrics Intell. Lab. Syst., 30(1), 109-115. Wold, S. and Sjöström, M. (1998) Chemometrics and Its Roots in Physical Organic Chemistry. Acta Chem. Scand., 52(5), 517-523. Wold, S. (2001) Personal Memories of the Early PLS Development. Chemometrics Intell. Lab. Syst., 58(2), 83-84. Wold, S., Sjöström, M. and Eriksson, L. (2001) PLS-Regression: A Basic Tool of Chemometrics. Chemometrics Intell. Lab. Syst., 58(2), 109-130.

59 Woodhouse, E.J. and Breyman, S. (2005) Green Chemistry as Social Movement? Sci. Technol. Hum. Values, 30(2), 199-222. Zhao, Y.H., Abraham, M.H. and Zissimos, A.M. (2003) Determination of McGowan Volumes for Ions and Correlation with van der Waals Volumes. J. Chem. Inf. Comput. Sci., 43(6), 1848-1854. Zhu, H., Tropsha, A., Fourches, D., Varnek, A., Papa, E., Gramatica, P., Öberg, T., Dao, P., Cherkasov, A. and Tetko, I.V. (2008) Combinatorial QSAR Modeling of Chemical Toxicants Tested against Tetrahymena puriformis. J. Chem. Inf. Model, 48(4), 766-784. Zupan, J. and Gasteiger, J. (1993) Neural Networks for . VCH, Weinheim, Germany.

60