Comprehensive Medicinal Chemistry III 30010. Fingerprints and Other
Total Page:16
File Type:pdf, Size:1020Kb
Comprehensive Medicinal Chemistry III 30010. Fingerprints and other molecular descriptions for database analysis and searching Dávid Bajusz1, Anita Rácz2,3, and Károly Héberger2 1 Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Institute of Organic Chemistry Hungarian Academy of Sciences, Magyar tudósok krt. 2, H-1117 Budapest, Hungary E-mail: [email protected] Phone: + 36 1 382 69 74 2 Plasma Chemistry Research Group, Institute of Materials and Environmental Chemistry, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Magyar tudósok krt. 2, H-1117 Budapest, Hungary E-mail: [email protected] and [email protected] Phone: + 36 1 382 65 09 3 Department of Applied Chemistry, Szent István University, Villányi út 29-43, H-1118 Budapest, Hungary Cite as follows: Dávid Bajusz, Anita Rácz, and Károly Héberger*, Chapter 3.14 – Chemical Data Formats, Fingerprints, and Other Molecular Descriptions for Database Analysis and Searching. In: Reference Module in Chemistry, Molecular Sciences and Chemical Engineering Comprehensive Medicinal Chemistry III - Volume 3 Editor-in-Chiefs Samuel Chackalamannil, David P. Rotella and Simon E. Ward; In silico methods Eds: A. Davies, C. Edge, Available online 13 June 2017, Pages 329–378 Oxford: Elsevier. Elsevier (2017) http://dx.doi.org/10.1016/B978-0-12-409547-2.12345-5 ISBN: 9780128032008 Abstract In this chapter we strive to provide a comprehensive but reasonably compact overview of the various possibilities for the computational representation of molecules. This includes a detailed introduction to the most commonly used chemical file formats (complemented with a few novel or more specific representations), a thorough overview of the theoretical backgrounds of various molecular fingerprints and descriptors, and a complete subchapter devoted to similarity measures and data fusion approaches. Finally, we provide a list of the most important online chemical databases and conclude the chapter with a short outlook on present trends and future expectations. Keywords molecular fingerprint, similarity, distance metric, molecular descriptor, database analysis, data fusion, cheminformatics, drug design, chemical file format, virtual screening 1. Introduction Molecules possess an abundance of properties. 3D structure, atom connectivity, shape and physicochemical descriptors such as molecular weight or logP (logarithm of the n-octanol/water partition coefficient) are just a few of such properties that are usually of interest in the overlapping domains of cheminformatics, molecular modeling and drug design. Many of these properties have a quite intricate nature: for example 3D structure, which can rarely be characterized with a single conformation (but instead an ensemble of conformations), or even simple properties, such as molecular weight that depends on the isotope distribution of the composing atoms. The affinity of the molecule towards various types of environment (such as a physiological solution, a mixture of solvents or a particular protein binding pocket) – which is of particular interest in the mentioned fields – expands the property space of even a single compound beyond comprehension. As such, it is currently impossible to give a perfectly accurate and complete computational representation of a molecule that accounts for its every possible aspect; however in most cases this is not necessary. Nonetheless, one should always be aware of the implied simplifications in the computational representation of molecules, especially when making predictions based on them. In other words, the main responsibility of the computational chemist (or drug designer or cheminformatician) is to know, which aspects of a molecule are important to be considered for the given application. If it is so, then one can utilize the predictive power of computational chemistry techniques to provide invaluable guidance for medicinal chemistry endeavors. In the recent decades, medicinal chemistry has been armed with a growing number of large, highly featured and annotated databases. These databases contain the ever growing public (and proprietary) knowledge that is being aggregated by medicinal chemistry research and they provide an invaluable foundation for retrospective analyses and prospective/predictive work. Databases containing chemical information can be queried in a number of ways, each of which requires specific types of molecular representations. Substructure searches require an efficient chemical query language (e.g. SMARTS), while similarity searches need means to quantify the similarity of two molecules and appropriate structural representations that allow for such operations (e.g. molecular fingerprints). It is also common to query databases for specific values or ranges of quantifiable molecular properties such as molecular weight or logP. In this chapter, we provide a comprehensive overview of molecular representations, including file formats, fingerprints and molecular descriptors – with an emphasis on those that are widely applied in works involving large databases. We also dedicate a subchapter to molecular similarity (and the diverse methodologies that have been developed to elaborate this sub-field), as a ubiquitously applied concept in cheminformatics and drug design. Last but not least, we provide a short overview of the most prominent online resources of molecular information that ultimately fuel most of the computational chemical research that involves a consideration of large compound sets. 2. Chemical file formats and data structures The means to store chemical information started to develop in parallel with the first computers (in fact, earlier than computers became standard work tools). Since that time, generations of file formats have been developed (and have occasionally become obsolete). Since each file format was developed with a specific purpose, each encodes different aspects of the structure and properties of a molecule. Some of them are optimized for compactness, while others allow the storage of various properties or other information besides the 3D structure of the molecule. A complete overview of the available chemical file formats would be a futile effort – and probably pointless, as there is little relevance (from a cheminformatics point of view) of covering file formats that are developed for e.g. specific molecular dynamics or quantum chemistry programs. Instead, we will dedicate this subchapter to introduce the reader to today’s most widely used, cross-platform, human-readable chemical file formats, with an emphasis on their use in database searching and other subfields of cheminformatics. Throughout these subsections, we will occasionally refer to the structural representation of L-phenylalanine (see Figure 1) in the various file formats as a common example. Figure 1. Neural (1) and zwitterionic (2) forms of L-phenylalanine. The α-carbon is annotated with the absolute configuration of the stereogenic center. Before proceeding with the description of specific file formats, we make note of a convenient means for the interconversion of various chemical file formats. While many modeling software suites are capable of handling multiple file formats (and consequently, to convert them), computer programs dedicated specifically for this tasks also exist. Probably the best known example is Open Babel, an open source software for the interconversion of chemical structures between an abundance of file formats (over 110 for the current version).1 2.1 Line notation formats The principle of line notation formats is to store chemical structures unambiguously, but as compactly as possible (usually on a single line). In fact, the IUPAC Recommendations on Organic and Biochemical Nomenclature can be thought of as a line notation system as well.2 (Though ironically it is not necessarily more human-readable than e.g. the SMILES format, especially for large molecules.) In that “format”, the representation of L-phenylalanine would be (2S)-2-amino-3-phenylpropanoic acid (or somewhat less intuitively, (2S)-2-azaniumyl-3- phenylpropanoate for the zwitterionic form). While IUPAC names can unambiguously describe arbitrarily complicated molecules, the names themselves are often very complicated as well. Also, a substructure can usually be specified in multiple ways (depending on other groups, the order of priority, etc.), which essentially removes text-based IUPAC name queries from the toolbox of substructure searching. From the available tools for chemical identification and documentation, today’s standard is the CAS (Chemical Abstracts Service) Registry number.3,4 CAS maintains a constantly updated database (the CAS Registry) to store every reported chemical structure (cca. 111 million at the time these lines are written) with a uniquely assigned identifier, the CAS Registry number (e.g. for L-phenylalanine: 63-91-2). In addition to the database itself, the CAS Registry powers the two major chemical information services of CAS: Scifinder (for querying chemical structures and reactions, primarily in the literature)5 and STN (a search engine that provides access to patent content).6 While CAS Registry numbers are quite compact linear notations, they do not provide direct, human-readable information on molecular structure, nor relation/proximity to another substance (e.g. the CAS number for racemic phenylalanine is 150-30-1). Efforts to provide compact means of chemical structure representation using line notations date back to the late 1940’s when the