QSAR Study of Aquatic Toxicity by Chemometrics Methods in the Framework of REACH Regulation

UNIVERSITY OF MILANO – BICOCCA DEPARTMENT OF EARTH AND ENVIRONMENTAL SCIENCES Doctoral Degree Course in Chemical Sciences Cycle XXVII Ph.D. Thesis QSAR study of aquatic toxicity by chemometrics methods in the framework of REACH regulation Matteo Cassotti Tutor: Prof. Roberto Todeschini Co-Tutor: Dr. Viviana Consonni Dr. Davide Ballabio Academic year: 2014-2015 Cover illustration: ‘water-art-wallpaper-5’ from: https://newevolutiondesigns.com Acknowledgements Above all, I would like to thank my supervisor Prof. Roberto Todeschini for giving me the opportunity to undertake this Ph.D. project, being always open to teach me chemometrics and involving me in other interesting activities. Special thanks to Dr. Davide Ballabio who supported and revised my work, taught me how to use MATLAB and asked me to help with teaching. I am grateful to Dr. Viviana Consonni and Dr. Andrea Mauri, especially for the guidance and help about molecular descriptors and chemoinformatics aspects. I acknowledge Dr. Igor Tetko who allowed me to initiate this work within the ECO project, which was a positive international experience. I wish to thank Dr. Eva Bay Wedebye and Dr. Nikolai Georgiev Nikolov for the good discussions and help in preparing the data. I am thankful to Prof. Rasmus Bro for suggesting new chemometrics methods and ways (free of charge) to learn Danish. I would like to thank all the other people that I met in the lab in these years, who gave their contribution on a work and social level: Kamel, Faizan, Kai, Pantelis, Valentina, Ioana, Stefan, Alberto, Francesca, Eva, Svava, Rikke, Sine, Marianne, Monika. In the end, I am thankful to Tine for the scientific discussions, emotional support, networking and for believing in me. iii To those who believe in me more than I do v Contents Abbreviations .......................................................................................................... xi Preface .................................................................................................................. xiii Introduction ............................................................................................................. 1 1.1 Aquatic toxicity ................................................................................................. 1 1.2 REACH regulation ............................................................................................ 5 1.3 QSAR: background and role in the regulatory context...................................... 7 1.4 State of the art of QSAR in aquatic toxicity .................................................... 12 1.4.1 QSAR models for Daphnia magna ....................................................... 13 1.4.2 QSAR models for Pimephales promelas ............................................... 18 1.5 Is there need for new models? ......................................................................... 22 Data ......................................................................................................................... 25 2.1 Acute lethal toxicity tests ................................................................................ 25 2.2 Data quality ..................................................................................................... 26 2.3 Daphnia magna dataset ................................................................................... 29 2.3.1 Data for model development ................................................................. 29 2.3.2 Additional data for model validation and extension .............................. 31 2.4 Pimephales promelas dataset ........................................................................... 33 Methods .................................................................................................................. 35 3.1 Description of molecular structure .................................................................. 35 3.1.1 Molecular format: SMILES notation..................................................... 35 3.1.2 Molecular descriptors ............................................................................ 37 3.1.3 Binary fingerprints ................................................................................ 39 3.2 Selection of molecular descriptors................................................................... 40 3.2.1 Unsupervised variable reduction ........................................................... 41 3.2.2 Supervised variable selection ................................................................ 42 3.2.2.1 Genetic algorithms .................................................................... 44 vii 3.2.2.2 Reshaped sequential replacement ............................................. 46 3.3 Regression methods ......................................................................................... 50 3.3.1 Multiple linear regression ...................................................................... 50 3.3.1.1 Ordinary least squares regression ............................................. 50 3.3.1.2 Partial least squares regression ................................................. 51 3.3.2 K-nearest neighbours ............................................................................. 52 3.3.2.1 Distance measures .................................................................... 53 3.3.3 Support vector regression ...................................................................... 55 3.3.4 Gaussian process regression .................................................................. 57 3.4 Consensus modelling ....................................................................................... 59 3.5 Applicability domain ....................................................................................... 60 3.6 Model validation .............................................................................................. 62 3.7 Statistical parameters for regression diagnostic ............................................... 65 3.8 Analysis of data structure: principal component analysis ................................ 67 3.9 Software ........................................................................................................... 68 3.9.1 Leadscope Enterprise™ ......................................................................... 69 Results on Daphnia magna .................................................................................... 71 4.1 Explorative analysis ......................................................................................... 71 4.2 Calculation of molecular descriptors and data setup ....................................... 74 4.3 Descriptor selection and model calibration...................................................... 74 4.4 Summary of results .......................................................................................... 76 4.5 Discussion of the kNN model .......................................................................... 77 4.5.1 Analysis of the residuals and neighbourhood behaviour ....................... 78 4.5.2 Correlation between model descriptors and toxicity ............................. 82 4.6 Additional external validation and extension of the kNN model ..................... 86 4.6.1 Analysis of neighbourhood behaviour ................................................... 89 4.7 Comparison with literature models .................................................................. 91 4.8 Compliance with the OECD principles ........................................................... 95 Results on Pimephales promelas ........................................................................... 97 5.1 Explorative analysis ......................................................................................... 97 5.2 Modelling with Leadscope Enterprise™ ......................................................... 99 5.3 Modelling with DRAGON descriptors .......................................................... 100 5.4 Modelling with binary fingerprints ................................................................ 102 5.5 Summary of results ........................................................................................ 103 viii 5.6 Discussion of the kNN model ........................................................................ 105 5.6.1 Analysis of the neighbourhood behaviour ........................................... 107 5.6.2 Correlation between model descriptors and toxicity ........................... 108 5.7 Investigating the effect of heterogeneity and experimental variability ......... 110 5.8 Comparison with literature models ................................................................ 111 5.9 Compliance with the OECD principles ......................................................... 112 Conclusions and perspectives ............................................................................. 115 Bibliography ........................................................................................................ 121 List of Tables ........................................................................................................ 145 List of Figures ...................................................................................................... 149 List of publications .............................................................................................. 153 Deliverables .......................................................................................................... 157 Appendix I: QSAR model for Daphnia

QSAR Study of Aquatic Toxicity by Chemometrics Methods in the Framework of REACH Regulation

Aldrich Organometallic, Inorganic, Silanes, Boranes, and Deuterated Compounds

National Institute of Public Health and Environmental Protection Bilthoven the Netherlands

M-Sw, SWU.T ISB3S, W.G.L V&H #U=- Lieponti All Rights Reserved

Molecular Summary Tables

Environmental Health Criteria 15 TIN and ORGANOTIN COMPOUNDS A

Total Bond Energies of Exact Classical Solutions of Molecules Generated by Millsian 1.0 Compared to Those Computed Using Modern 3-21G and 6-31G* Basis Sets

A Sheffield Hallam University Thesis

Aldrich Organometallic, Inorganic, Silanes, Boranes, and Deuterated Compounds

Toxicological Profile for Tin and Tin Compounds

United States Patent Office Patented Dec

Chemical Classification List