Inchi, the IUPAC International Chemical Identifier Stephen R Heller1*, Alan Mcnaught2, Igor Pletnev3, Stephen Stein1 and Dmitrii Tchekhovskoi1
Total Page:16
File Type:pdf, Size:1020Kb
Heller et al. Journal of Cheminformatics (2015) 7:23 DOI 10.1186/s13321-015-0068-4 RESEARCH ARTICLE Open Access InChI, the IUPAC International Chemical Identifier Stephen R Heller1*, Alan McNaught2, Igor Pletnev3, Stephen Stein1 and Dmitrii Tchekhovskoi1 Abstract This paper documents the design, layout and algorithms of the IUPAC International Chemical Identifier, InChI. Keywords: InChI, InChIKey, Chemical structure linear notation, Chemical identifier, IUPAC standard Introduction algorithms and implementation details are briefly dis- InChI is the International Chemical Identifier developed cussed. Finally, we provide information about InChI under the auspices of IUPAC, the International Union of Software, licensing, known problems/limitations, and fu- Pure and Applied Chemistry [1], with principal contribu- ture prospects for InChI. tions from NIST (the U.S. National Institute of Standards and Technology [2]) and the InChI Trust [3]. Background This paper documents the design, layout and algo- A chemical identifier is a text label that denotes a chem- rithms of InChI. It is intended to provide a reasonably ical substancea. These labels are of the utmost importance detailed description without being overlong for a journal as they provide a convenient means of comparing and article. For a briefer introduction, which also provides distinguishing chemicals in a variety of applications, from more detail on historical and organizational matters, the the design of new materials to legal and regulatory issues. reader is referred to a recent paper by the same authors The main requirement for an identifier is that the label [4]. For a more technical description, one may consult must be unambiguous: the same label must always refer to the InChI Technical Manual [5] and the free source the same substance, and no other substance may have this codes of the InChI software, which are available from label. Two different substances must have different labels. the InChI Trust [6]. Note that an identifier may not be strictly unique in The paper is organized as follows. First, we discuss the the sense that the same substance may be, on a case-by- general concepts associated with chemical identifiers. case basis, denoted by several distinct synonymical labels Then we outline the design goals of InChI and our gen- (provided that the lists of synonyms for different sub- eral approach, focussing on the InChI model of chemical stances do not overlap). An obvious example is given by structure and the hierarchical layered structure of the IUPAC chemical nomenclature that allows one to pro- Identifier; the concept of Standard InChI is introduced. duce and use different names for a single compound; This is followed by a detailed description of each of the nevertheless, all these names unambigously identify the possible major InChI layers, accounting for molecular compound. Though not necessary, strict uniqueness, connectivity, charge, stereochemistry, isotopic enrich- which is always assigning a single label to a particular ment, position of hydrogen atoms and bonding in metal substance, is highly convenient and very desirable. compounds, and the sublayers associated with these The concept of “chemical identifier” heavily relies on layers. We then describe the workflow of InChI gener- the concepts of chemical substance and chemical identity. ation (normalization, canonicalization, and serialization The IUPAC Compendium of Chemical Terminology, the stages), as well as generation of the compact hashed “Gold Book”,defines“chemical substance” as “Matter of code derived from InChI (InChIKey); the related constant composition best characterized by the entities (molecules, formula units, atoms) it is composed of. Phys- * Correspondence: [email protected] ical properties such as density, refractive index, electric 1Biomolecular Measurement Division, National Institute of Standards and Technology, Gaithersburg, MD 20899-8362, USA conductivity, melting point etc. characterize the chemical Full list of author information is available at the end of the article substance” [7]. As this definition implies, identity of a © 2015 Heller et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http:// creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Heller et al. Journal of Cheminformatics (2015) 7:23 Page 2 of 34 chemical substance is determined by its constituent units their associated registry is a decentralized compendium of and properties. It is noteworthy that even at this highly handbooks and nomenclature rules. In these cases, there is general level, this consideration is somewhat restrictive. no algorithm for direct conversion of molecular structures For example, the concept of “chemical substance” is not to these identifying labels). applicable to material that is not of constant composition Despite the widespread use of registry-lookup authority- (e.g., oil). This consideration is also somewhat counter- assigned chemical identifiers, these types of identifier have intuitive, for example, as concerns aggregate states, poly- a number of substantial drawbacks. For example, even the morphs, etc. Thus, most chemists would agree that largest registry cannot include all known chemical sub- “water” is a chemical substance, that may appear as steam, stances. Furthermore, no registry can include a substance ice and liquid water, and that all three should have the that has not previously existed and for which a hypothe- same chemical identifier – despite the fact that each may sized structure is drawn or computed. Furthermore, some be isolated as a different state of matter, placed in a test authorities may impose restricted access and/or require tube or stored in a bottle. In other words, the “identifying payment for assigning labels and even for lookup in their power” of a chemical identifier is inherently limited. registries. The oldest known chemical identifiers are words of The alternative to authority-assigned is structure-based natural languages describing common chemicals with chemical identifiers. These are derived from molecular terms like “water”, “iron” or “table salt”; they are trivial structural formulae, either drawn in print form or pre- names, in modern nomenclature. Notably, trivial names sented in digital form. When the algorithm for an identi- exemplify the principle that a chemical identifier is not fier’s derivation and/or a related utility tool becomes necessarily related to molecular structure. The identity publicly available, anyone has the ability to produce the of chemical substances denoted by trivial names was his- identifier for a given structure. (Note that structure- torically determined by a set of characteristic physical based identifiers still may require registry lookup to re- and chemical properties, long before exact structures cover the structure from the label). were resolved, or even before the very concept of mo- The earliest examples of structure-based identifiers are lecular structure evolved in the second half of the 19th the systematic names of classical chemical nomencla- century. Of course, today’s trivial names are associated tures established either by IUPAC or by CAS. However, with chemical stuctures (yet the structures may not be the nomenclature rules developed by these authorities fully defined, as is common for natural products). are not easy to learn and practice, even for professional A trivial name is an example of a registry-lookup chemists. Misinterpretation may result in ambiguous chemical identifier: it provides a unique label for the naming. Finally, and most importantly, systematic chem- named substance but the label itself says nothing (or ical names are not well suited for digital representations little) about the characteristic properties and structure. and the internet: they tend to be too long and contain Such data are stored in electronic or printed registries non-alphanumeric characters (i.e., other than Latin let- (handbooks) that uniquely associate the label with the ters and digits). As an example, Figure 1 gives the properties/structure. Retrieving reference data requires a IUPAC systematic chemical name for the marine toxin registry lookup. palytoxin [14]; this is compared with the much shorter More recent examples of registry-lookup identifiers InChIKey (discussed later). are those associated with large printed or electronic col- Since the second half of the 20th century, chemical lections of chemical structures and properties – Beilstein structure linear notations have become well established and Gmelin Registry numbers [8], Chemical Abstracts as alternative to classical nomenclature. These textual Service (CAS) Registry numbers by the American labels are derived using specific algorithms from mo- Chemical Society [9], EC numbers from the European lecular structural formulae. They serve as handy textual Community Inventory [10], CID and SID numbers substitutes for structural formulae, being much more assigned by PubChem [11], and identifiers assigned by convenient in database and internet applications. Exam- ChEMBL [12], ChemSpider [13], etc. ples are the pioneering Wiswesser line-formula notation, Note that all the above registry-lookup identifiers