Substructural Analysis Techniques Fc, R Structure-Property Correlation Within Computerised Chemical Information Systems
Total Page:16
File Type:pdf, Size:1020Kb
Substructural Analysis Techniques fc, r Structure-Property Correlation within Computerised Chemical Information Systems David Bawden Ph. D. Thesis, Sheffield University Postgraduate School of Librarianship and Information Science February, 1978 BEST COPY AVAILABLE Variable print quality r "Acknowl ed clemen ts I have to thank particularly George Adamson, for constant help and encouragement, and the staffs of PSGLIS and (in the latter phases) Pfizer Central Research for support and forbearance. Also, for varied reasons, my thanks go to Val Addis, Barbara Bateson, Pam Bratherton, Phil Briggs, Judy Bush, the Department of Education and Science, Dick Jones, Mike Lynch, Alice-McLure, Ian McLuure, Brian Pedley, David Saggers, Sheffield University Computing Services, Fiona Strong, Vera Swallow, Iwona Szymanska, Judith Ufton, George Vleduts, Linda Wickham, and Peter Willett. CONTENTS Chapter Page Summary 1 1 Introduction 3 2 Background 5 2.2 Chemical Structure Representation 8 2.3 Structure-Property Correlation 55 3 Substructural Analysis Techniques 110 4 Analysis 114 4.1 Multiple Regression Analyses 145 Serum binding of penicillins 146 Partition coefficients of diverse structuresl54 Rats of bromination of benzene derivatives 160 pK values of benzene carboxylic acids 168 Heats of formation of unsaturated aliphatics173 Heats of vaporisation of diverse structures 177 Physicochemical and biological properties of benzimidazoles 193 pKa values of diverse heterocycles 203 Boiling points of Alicycles 216 4.2 Cluster Analyses 226 5 Conclusions 242 Appendix Computer Programs 256 References 261 Figures 319 Figures (Cluster Analyses) 351 Tables 4z 4 ý() .I- Summa ry David Bawden Ph. D. Thesis, 1978 Substructural Analysis Techniques for Structure-Property Correlation within Computerised Chemical Information Systems Summary The work described in this thesis involves a novel method of substructural analysis, with potential application for structure- property correlation and information retrieval within computerised chemical information systems. A review is given of the development of the concept of chemical structure and its representation, its application in computerised chemical information systems, and methods for correlating structure with molecular properties. A method is presented for derivation of structural features, (WLN) representing the whole structure, from Wiswesser Line Notation by computer program. These features are then used as variables in statistical analysis procedures: in this work multiple regression analysis and cluster analysis are used. This procedure allows for a rapid, convenient and thorough analysis of large data-sets. The type of structural features used may be easily varied, allowing for investi- gation of factors such as ring substitution patterns, group interactions, and three-dimensional structure. The method is applicable to sets of diverse or structurally related compounds. Statistical tests of the results enable quantitative testing of hypotheses. Multiple regression analysis allows a direct, quantitative correlation between structure and molecular property, and subsequent property prediction. It is applied to sets of aliphatic, alicyclic, () (i 012.1 aromatic, and heterocyclic compounds, including sets of highly diverse structures. Properties examined include biological effects, toxicty, pK, thermochemical properties, boiling point, solubility, and partition coefficient. Some of these properties are highly dependent upon electronic and steric effects, and hence upon relative position of substituents, and on three-dimensional structure. Highly significant correlations are obtained in all cases, and the potential for property prediction is demonstrated. Cluster analysis is applied to several sets of structures. Intuitively sensible classifications are obtained, and the potential discussed. for both property prediction and information retrieval Since these techniques involve the widely used WLN, relatively simple COBOL programs, and standard statistical packages, they should be applicable within operational environments. Chapter One Introduction 'Book One, Part One, Chapter One, Page One. What ä great start' (Charles Schulz) / I, 0 ()4 This thesis is divided into four main sections. Chapter 2 "sets the scene" by reviewing the background to this work. Firstly, the development of the concept of chemical structure, and its representation is discussed. Secondly, the relatively recent incorporation of these ideas within computerised chemical information systems is outlined. Thirdly, the various methods of quantitative structure-property correlation are described. In Chapter 3, aspects of substructural Analysis techniques are outlined. Fragmentation techniques are discussed, and the statistical techniques to be used are described. Chapter 4 contains an account of the analyses based on WLN structure representation which comprise the practical work of this study. This chapter is divided into two parts, the first dealing with multiple regression analysis, and the second with cluster analysis. Chapter 5 concludes and summarises the work. Brief details of the computer programs are given in an Appendix. Chapter Two Background 'It is hard enough to keep up with life, let. alone technical information' (Anonymous survey respondent) / OU The purpo_e of this chapter is to give an outline of the historical development, present state of knowledge, and current practical applications of areas of study relevant to the subject matter of this thesis, in order to put the work described below into context. This involves concentration on two major themes: firstly the development and applications of chemical structure representations, and secondly the development and application of for the methods correlation of chemical structure with molecular properties. The unifying factor here is the concept of chemical structure. This provides an unambiguous description of a compound, and a basis for rationalising its observed properties, from which may be deduced general principles regarding the relationships between structuro and property. It is significant that the first concept of a straightforward quantitative relationship between chemical structure end biological activity was put forward partly by Alexander Crum Brown (CRUD{ 331 VN et nl, 1868) who made notable contributions to the development and use of structural representations. This structure concept is fundamental to the greater part of modern chemically related Sciences. The ability to adequately end conveniently represent chemical structure is of obvious importance. It will be seen that, ailthough sophisticated nathematical nedels of stoleculr: r structure have been devised, the chemical structure diagrran remains the rroat - widely used and valuable representation, and that computer- readable structure representations are, for the most part, partial or total representations of 'a structure diagram. These structural formulae are not only the most convenient means for communication 00°I of chemical ideas, but are, as Hammond has pointed out, ftthe basic vehicle in the search for patterns (in chemical data)" (HAM MOND, 1974). Much of the discussion below will centre on their adequacy in this search. / 0118 2.2. Chemical Structure Representation In the aection below, the development of the concept of chemical at rscture and its representation will be considered first. The handling of structural representations in chemical information systems will be briefly discussed, and some relevant aspects of the operation of such aystema will be considered. The discussion will centre on chemical structure information systems, i. e. those having files of structural representations with which other forms of data are associated. It is with such systems, especially if computer-based, that the techniques of structure-property correlation to be described below are likely to prove of most value. / 010 2.2.1. Development of the Concept of Chemical Structure and its Repreaentati onto Oil 2.2.1. ßevelopmPnt of the Concept of Che-mic*1 Str-aatcro enG its 13epre een t Rt i. on The advances made within the past two centuries in those aspects of the natural sciences which depend on an understanding of the properties of material substances, which encompass a large part of modern science and technology, have depended heavily upon the development of two chemical concepts. The first of these is the concept of chemical composition, i. e. the make-up of a substance in terms of numbers of constituent atoms. The second is the concept of chemical structure, i. e. the arrangement of the constituent atoms. The recognition of the significance of these factors and their elucidation for a great variety of compounds were in themselves important for the development-of all branches of chemistry. Their full value could only be realised, however, with the devising of methods for representing composition and structure, either as a nomenclature, a notation, or a diagram. Only when this was achieved could the unifying concepts of composition and structure be used to rationalise, systematise and record the mass of experimental observations produced in the upsurge of chemical research which commenced in the early nineteenth century (PARTINGTON, 1964a). 012 It has been pointed out that "to write a full description of the origin, growth and misadventures of the language of cher is3try is to write a history of the sciencelt (HUM, 1907). Similarly, in order to describe the development of chemical structural represen- tation it is necessary