1.3 Quantum Chemistry
Total Page:16
File Type:pdf, Size:1020Kb
PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this publication click this link. http://hdl.handle.net/2066/72267 Please be advised that this information was generated on 2021-10-05 and may be subject to change. Representation of Molecules and Molecular Systems in Data Analysis and Modeling EEN WETENSCHAPPELIJKE PROEVE OP HET GEBIED VAN DE NATUURWETENSCHAPPEN, WISKUNDE EN INFORMATICA PROEFSCHRIFT TER VERKRIJGING VAN DE GRAAD VAN DOCTOR AAN DE RADBOUD UNIVERSITEIT NIJMEGEN OP GEZAG VAN DE RECTOR MAGNIFICUS PROF. MR. S.C.J.J. KORTMANN, VOLGENS BESLUIT VAN HET COLLEGE VAN DECANEN IN HET OPENBAAR TE VERDEDIGEN OP WOENSDAG 2 APRIL 2008 OM 13:30 UUR PRECIES DOOR EGON LENNERT WILLIGHAGEN GEBOREN OP 27 OKTOBER 1974 TE ARNHEM Promotores Prof. dr. L.M.C. Buydens Prof. dr. P. Murray-Rust (University of Cambridge, United Kingdom) Copromotor Dr. R. Wehrens Manuscriptcommissie Prof. E. Vlieg Prof. dr. J. Gasteiger (Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany) Dr. C. Steinbeck (European Bioinformatics Institute, United Kingdom) The work presented in this thesis was supported financially by the Netherlands Organization for Scientific Research (NWO). Copyright © 2008 by E.L. Willighagen All rights reserved ISBN 978-90-9022806-8 Contents 1 Introduction 7 1.1 Molecular Representations . .8 1.2 Chemical Graphs . .9 1.3 Quantum Chemistry . 10 1.4 Numerical Representations . 11 1.5 Chemometrics . 12 1.5.1 Example: CoMFA . 12 1.5.2 Example: classification of enzyme reactions . 13 1.6 Challenges . 14 1.6.1 Representation of Molecular Systems . 14 1.6.2 Data Storage and Communication . 15 1.7 Selected Problems . 16 Bibliography . 17 2 Molecular Chemometrics 23 2.1 Introduction . 24 2.2 Molecular Representation . 25 2.2.1 Molecular Descriptions . 26 2.2.2 Beyond the molecule . 28 2.3 Chemical Space, similarity and diversity . 29 2.4 Activity and Property Modeling . 30 2.4.1 Dimension Reduction . 32 2.4.2 Model Validation . 32 3 4 CONTENTS 2.5 Library Searching . 33 2.6 Conclusion . 35 Bibliography . 35 3 1D NMR in QSPR 45 3.1 Introduction . 46 3.2 Experimental . 48 3.2.1 Methods . 48 3.2.2 Data . 49 3.3 Results . 49 3.3.1 Data rank . 49 3.3.2 Predictivity . 50 3.3.3 Model interpretation . 53 3.4 Discussion . 57 3.5 Conclusions . 58 Bibliography . 59 4 Comparing Crystals 63 4.1 The Descriptor . 65 4.2 Data . 68 4.2.1 Cephalosporin data set . 69 4.2.2 Estrone data set . 70 4.3 Experimental . 73 4.4 Results . 74 4.4.1 Dissimilarity Classes . 74 4.4.2 Dendrograms and Partitionings . 74 4.4.3 Matching ESTRON10 . 76 4.5 Conclusions . 79 Bibliography . 79 5 Supervised SOMs 83 5.1 Introduction . 84 CONTENTS 5 5.2 Supervised self-organizing maps . 85 5.3 Experimental . 87 5.3.1 Data . 87 5.3.2 Representation in X and Y space . 88 5.3.3 Similarity calculations . 89 5.3.4 SOM training . 90 5.3.5 Software . 91 5.4 Applications . 92 5.4.1 Unit cell volume in Y space . 92 5.4.2 Adding space group information in Y space . 93 5.4.3 Analyzing simulated polymorphs . 96 5.5 Conclusions . 98 Bibliography . 99 6 Chemical Metadata in RSS 103 6.1 Introduction . 104 6.2 Implementations of RSS for chemical data sources . 106 6.2.1 Namespaces and RSS 1.0 . 106 6.2.2 Example 1. The ChemStock System . 110 6.2.3 Example 2. The Dutch Dictionary on Organic Chemistry . 110 6.2.4 Example 3. The World Wide Molecular Matrix . 111 6.3 Chemical postprocessing and aggregation of RSS metadata . 111 6.4 Discussion and Conclusions . 112 Bibliography . 115 7 Interoperability 117 7.1 Introduction . 118 7.2 The Importance of Open Specifications for Algorithms and Data . 120 7.3 The Blue Obelisk Dictionary . 125 7.3.1 The Dictionary . 125 7.3.2 Finding Implementations . 127 7.4 The Blue Obelisk Repository . 128 6 CONTENTS 7.5 Web Services . 130 7.6 Social Aspects . 130 7.7 Conclusion . 132 Bibliography . 133 8 Discussion and Outlook 137 8.1 Information Content . 137 8.2 Representation Characteristics . 138 8.3 Validation . 139 8.4 Reproducibility . 141 8.5 Data Storage and Communication . 141 8.6 Outlook . 142 8.6.1 Crystal Engineering . 143 8.6.2 Data Fusion . 143 8.7 Conclusion . 143 List of Abbreviations 145 Summary 149 Samenvatting 153 Curriculum Vitae 157 Publication List 159 Dankwoord 161 Chapter 1 Introduction The topic of this thesis is representation of molecules and molecular systems. Such a representation is needed to allow analysis and manipulation of chemical structures in the computer. This is of paramount importance in areas like drug design, synthesis plan- ning, property prediction, crystal structure engineering, structure elucidation, searching in chemical literature, exchange of chemical knowledge, and structure elucidation. Many different representations have been developed, each capturing different bits of informa- tion about the molecular system under study. Unfortunately, in many cases it is unclear which part of the information is essential for a certain application. For example, although the boiling points correlates well with the number of carbon atoms in a series of alkane homologues [1], the carbon count descriptor is not generally useful for predicting other properties, or even the same property for a more diverse set of molecules. From simple physico-chemical principles, it is clear why this is the case. However, for more complex problems there is very little a-priori knowledge that guides us in choosing appropriate descriptors. Nevertheless, in certain areas specific habits have evolved; for example, a large part of the quantitative structure-activity and structure- property relationship (QSAR and QSPR) community routinely calculates hundreds or thousands of simple molecular descriptors, and uses various variable-selection techniques to extract the most useful ones. Unfortunately, validation of this process is almost impossible due to the small size of data sets. It would a giant leap forward if we could say beforehand, based on the characteristics of the molecular system and our aim, what descriptors would be most informative. This is currently, however, still too far-fetched. Therefore, we are forced to judge the quality of the representation on the basis on the quality of the prediction: if we are able to correctly predict properties of new compounds, then we conclude that the representations contains relevant information. This thesis studies the role of representation in modeling properties of molecular systems of organic molecules and in the exchange of molecular information. The following paragraphs give an overview on useful representations. 7 8 Chapter 1 Introduction 1.1 Molecular Representations The two most common methods to represent organic molecules are the (systematic) name and the 2D drawing of the molecule. They identify the molecule of interest, but can- not be used for machine processing. To prevent ambiguities, conventions describing how molecules should be named and drawn are needed. IUPAC name recommendations, and line notations such.