Chemoinformatics for Green Chemistry

Contents LIST OF PAPERS .................................................................................... 3 LISTS OF ACRONYMS AND ABBREVIATIONS ............................... 4 INTRODUCTION .................................................................................. 7 BACKGROUND .............................................................................................7 AIMS OF THE THESIS ..................................................................................9 METHODOLOGY ............................................................................... 11 QUANTITATIVE STRUCTURE-PROPERTY RELATIONSHIPS (QSPR)......11 Linear Free Energy Relationships (LFER)...............................................11 Linear Solvation Energy Relationship (LSER).........................................12 Reversed-Phase Liquid Chromatography (RPLC).....................................14 Quantitative Structure-Retention Relationships (QSRR) .........................14 Global Linear Solvation Energy Relationship (LSER)..............................15 Molecular Descriptors ...............................................................................16 The Definition of the Applicability Domain ..............................................17 The Distance-Based Approach Using PLS.......................................17 The Williams Plot .............................................................................18 Associative Neural Networks (ASNN)..............................................19 DATA ANALYSIS ........................................................................................20 Multiple Linear Regression (MLR)..........................................................20 Principal Component Analysis (PCA) .......................................................20 Partial Least Squares Projection to Latent Structures (PLS) ......................21 Neural Networks......................................................................................24 PHYSICOCHEMICAL PROPERTIES................................................ 27 VAPOR PRESSURE (P) ................................................................................27 PARTITIONING COEFFICIENT N-OCTANOL/WATER (KOW) ...................27 WATER SOLUBILITY (SW) ..........................................................................29 MELTING POINT .......................................................................................30 BOILING POINT .........................................................................................30 PERFLUORINATED COMPOUNDS .............................................................31 RESULTS AND DISCUSSION............................................................. 33 PREDICTION OF VAPOR PRESSURE ..........................................................33 COMPARISON OF MODELING OF ENVIRONMENTAL PROPERTIES BETWEEN LSER AND PLS........................................................................34 Prediction of Partitioning Coefficient (Kow)...............................................35 Prediction of Water Solubility (Sw)............................................................37 Comparison of Models - LSER and PLS ..................................................38 1 EXTENSION OF A PREDICTIVE MODEL FOR VAPOR PRESSURE.............39 QSPR MODELS OF PERFLUORINATED CHEMICALS (PFCS).................40 QSPR Models with Melting Point ...........................................................41 QSPR Models with Boiling Point ............................................................42 Applicability Domain...............................................................................43 QSRR MODELS .........................................................................................44 Global LSER Models ...............................................................................45 QSRR Models Using Semi-Empirical Descriptors.....................................46 QSRR Models Using Theoretical Molecular Descriptors ............................46 CONCLUSIONS.................................................................................... 49 ACKNOWLEDGEMENTS .................................................................. 51 REFERENCES....................................................................................... 52 2 List of Papers This thesis is based on the following papers, referred to in the text by roman numerals. I. Öberg, T., Liu, T. (2008): Global and local PLS model to predict vapor pressure. QSAR & Combinatorial Science, 27(3), 273-279. II. Liu, T., Öberg, T. (2009): Modeling of partition constants: linear solvation energy relationships or PLS regression? Journal of Chemometrics, 23(5-6), 254-262. III. Öberg, T., Liu, T. (2010): Extension of a prediction model to estimate vapor pressure of perfluorinated compounds (PFCs). Submitted to Chemometrics and Intelligent Laboratory Systems. IV. Bhhatarai, B,. Teetz. W., Liu, T., Öberg, T., Jeliazkova. N., Kochev, N., Pukalov, O., Tetko, I.V., Kovarich, S., Gramatica, P. (2010): CADASTER QSPR models for predictions of melting and boiling points of perfluorinated chemicals. Manuscript. V. Liu, T., Nicholls, I.A., Öberg, T. (2010): Comparison between theoretical and experimental models for characterizing solvent properties using reversed-phase liquid chromatography (RPLC). Submitted to Journal of Chromatography A. Paper I and paper II are reprinted with kind permissions of Wiley-VCH Verlag GmbH & Co. KGaA. and John Wiley & Sons Ltd., respectively. 3 Lists of Acronyms and Abbreviations AD Applicability domain ADME Absorption, distribution, metabolism and excretion AFC Atom/fragment contribution ASNN Associative neural network BP Boiling point CADASTER CAse studies on the Development and Application of in-Silico Techniques for Environmental hazard and Risk assessment CV Cross-validation ECHA European Chemicals Agency EINECS European Inventory of Existing Commercial Chemical Substances EPA Environmental Protection Agency EU European Union FTOH Fluorotelomer alcohols GA Genetic algorithm GC Gas chromatography GETAWAY Geometry, topology, and atom-weights assembly GLC Gas-liquid chromatography HBA Hydrogen bond acceptor HBD Hydrogen bond donor HPLC High-performance liquid chromatography HTS High-throughput screening Kow n-Octanol/water partition coefficient LFER Linear free energy relationship LSER Linear solvation energy relationship LSST Linear solvent strength theory MLR Multiple linear regression MP Melting point NIPALS Nonlinear iterative partial least squares OECD Organisation for Economic Co-operation and Development OLS Ordinary-least-squares Pa Vapor pressure PL Liquid vapor pressure PS Solid vapor pressure PAHs Polycyclic aromatic hydrocarbons PCA Principal component analysis PCBs Polychlorinated biphenyls PFAs Perfluorinated acids 4 PFAS Perfluoroalkyl sulfonates PFCA Perfluorinated carboxylic acids PFCs Perfluorinated compounds PFOA Perfluorooctanoate PLS Partial least squares projection to latent structures PRESS Predictive residual sum of square QSAR Quantitative structure-activity relationship QSPR Quantitative structure-property relationship QSRR Quantitative structure-retention relationship R Gas constant REACH Registration, evaluation, authorization and restriction of chemicals RMSEC Root mean square error of calibration RMSEE Root mean square error of estimation RMSEP Root mean square error of prediction RPLC Reversed-phase liquid chromatography RSD Relative standard deviation RSS Residual sum of square SD Standard deviation SMILES Simplified molecular input line entry system SNN Supervised neural networks SOM Self organizing maps SPI Snyder polarity index SS Sum of the square Sw Water solubility T Absolute temperature UNIFAC Universal functional activity coefficient UNN Unsupervised neural networks UV/VIS Ultraviolet/visible VIP Variable importance in the projection WHIM Weighted holistic invariant molecular 5 6 Introduction Background Chemistry is creative and productive, generating the substances of high value from inexpesive raw materials. Chemistry and chemical industries help provide comfortable and modern living conditions. At the same time, many chemicals partition to the air, soil and water environments during their manufacture, transport and application. Today, approximately 143,000 substances are pre- registered by the European Chemicals Agency (ECHA), and 100,204 substances are listed in the European Inventory of Existing Commercial Chemical Substances (EINECS), 80,000 of which are thought to be used in the European Union (EU) (Brown, 2003). Thousands of chemicals currently used in the EU have not undergone risk assessment for their effect on the environment or on human health. It is therefore necessary to develop rapid and reliable methods of predicting environmental characteristics; in this thesis, we attempt to predict physicochemical properties based on the principles of chemoinformatics. With an explosion in the amount of data generated from combinational chemistry and high-throughput screening (HTS) in drug design, together with the rapid progression in computing technology, scientists have applied computing skills and knowledge of chemistry to the data analysis and accelerated the development of chemoinformatics (Schofield et al. 2001). The term bioinformatics was first presented at the University of Utrecht in the Netherlands in 1976 (Chen, 2006), and it rapidly became a popular term in the middle of the 1990s with a number of publications that provided excellent overviews of the fields. At the

Chemoinformatics for Green Chemistry

Semantic Web Integration of Cheminformatics Resources with the SADI Framework Leonid L Chepelev1* and Michel Dumontier1,2,3

Joelib Tutorial

Cheminformatics and Chemical Information

Committee on Publications and Cheminformatics Data Standards (CPCDS)

Scalable Analysis of Large Datasets in Life Sciences

Cheminformatics for Genome-Scale Metabolic Reconstructions

Print Special Issue Flyer

Trust, but Verify: on the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research

The Literature of Chemoinformatics: 1978–2018

Cheminformatics-Aided Pharmacovigilance: Application To

The Tonnabytes Big Data Challenge

Virtual Reality in Drug Discovery