
Canonical Line Notations InChI vs SMILES Krisztina Boda 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Overview • Compound naming • InChI • SMILES • Molecular equivalency – Isomorphism – Kekule – Tautomers • Finding duplicates 5th Joint Sheffield Conference on Chemoinformatics July, 2010 OH H N O What’s Your Name? 1. Unique numbers one-to-one • CAS registry number OH • BRN Beilstein Registry Number • EC (European commission) number CAS: 108-95-2 • CID (PubChem compound id) one-to-many 2. Chemical Nomenclatures • IUPAC name(s) IUPAC [ENG] phenol • Traditional name(s) IUPAC [HUN] fenol [carbolic acid, benzenol, phenylic acid, std.] 5th Joint Sheffield Conference on Chemoinformatics July, 2010 What’s Your Name? 3. Hashed identifiers many-to-one • NCI/CADD structure identifiers OH • InChIKey InChIKey= ISWSIDIOOBJBQZ-UHFFFAOYSA-N 4. Chemical line notations one-to-one • WLN (Wiswesser Line Notation) • SMILES • InChI InChI=1S/C6H6O/c7-6-4-1-3-5-6/h1-5,7H c1ccccc1O 5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI InChI Identifier (International Chemical Identifier) • Developed by IUPAC and NIST [first release 2005] • One reference implementation • Multiple hierarchical layers – No bond order OH • Not a file format! main layer H O 35 Cl + formula connection static and mobile H NH3 InChI=1S/C2H4ClNO2/c3-1(4)2(5)6/h1H,4H2,(H,5,6)/p+1/t1-/m1/s1/i3+0 version charge stereo isotope 5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES SMILES (Simplified Molecular Input Line Entry Specification) c1ccccc1O Oc1ccccc1 OH C1=CC=CC=C1O C1=CC=C(C=C1)O [CH]1=[CH][CH]=[CH][CH]=C1[OH] • Canonical SMILES [OEChem: c1ccc(cc1)O] OH – Unique name for each molecule in one system H – Not a global identifier O 35 Cl • Canonical Isomeric SMILES + NH3 – Encode isotope, double bond and chiral configuration C(C(=O)O)([NH3+])Cl [C@H](C(=O)O)([NH3+])[35Cl] [1] SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules David Weininger, J. Chem. Inf. Comput. Sci., 1988 (28) 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Molecule Registration 4 Is this already 5 in my 3 2 1 7 database? 6 OH 1 OH 2 N 7 N 3 6 4 O 5 HN 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Molecule Equivalency - Isomorphism • NP problem 1 4 • Complexity = HvyAtm! O2H 5 • Impractical for molecule 7 3 3 2 registration! 6N 4 1 7 5 6 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Canonicalization 1 3 • Transforms any 2 4 connection table into a 7 3 6 5 unique canonical form (independent of atom 6 4 1 2 OH 5 7 indices) N • Comparing molecules comparing unique, unambiguous string representations 5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES/InChI Generation Canonical SMILES Input structure InChI (OpenEye) Pre-processing Pre-processing [Normalization] • bond normalization • aromaticity perception \ • disconnect salts/metal atoms • eliminating radicals • normalize mobile hydrogens, variable protonation and charge • tautomer detection Canonicalization Canonicalization (partitioning algorithm) (Morgan algorithm) SMILES generation InChI generation 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Canonicalization Finding unique atom ordering 1. Initial partitioning of atoms by distinguishing them by their local properties. O N O O N 2. Recursively refining atom classes by distinguishing atoms by their neighbors. 3. When each atom belong to a unique partition unique canonical atom order exists. 5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES - Generation 1 2 3 O N 3 1 2 O O N c1cc2cc([nH]c2cc1C[C@@H](CC(=O)O)N)/C=C/C3CCC(=O)C3 Based on the canonical ordering of atoms, a unique SMILES is generated by depth-first search (DFS). 5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES Validation Test Set Name has to be 10M compounds independent of atom (ACD, Aldrich, Maybridge, MDDR, NCI etc.) numbering! 1 OH 5 repeat 100 × 2 7 7 N 3 6 1 … Randomly Renumbering Atoms 6 4 4 3 5 2 Canonical IsoSMILES Generation c1ccnc(c1)O Identical no String yes CT SMILES CT SMILES 5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES Validation 7 63 8 B B H H + B N B N N B B + H B N – O B B B O O N H Co O N H O H O – + B N H H H SMILES bond adamantane-like stereo ring systems highly-complex ring representation systems problem Number of total failures: 78 (0.0008%) 5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI Validation Test Set Name has to be 10M compounds independent of atom (ACD, Aldrich, Maybridge, MDDR, NCI etc.) numbering! 1 OH 5 repeat 100 × 2 7 7 N 3 6 1 … Randomly Renumbering Atoms 6 4 4 3 5 2 InChI Generation InChI=1S/C5H5NO/c7-5-3-1-2-4-5/h1-4H,(H,6,7) Identical no String yes SDF InChI SDF InChI 5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI Validation 28 27 H N O N N + N N NH H O HN – H O OH OH H O H + N N S N N + O S N H H H N HO charge layer stereo layer Number of total failures: 55 (0.0006%) 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Molecule Equivalency (Kekule) Canonical SMILES • aromatic form (no OH bond between OH aromatic atoms) N N InChI C1(=CC=CC=N1)O C=1(C=CC=CN1)O • no bond information c1cnc(O)ccc1 InChI=1S/c5H5NO/c7-5-3-1-2-4-6-5/h1-4H,(H,6,7) 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Molecule Equivalency - Tautomer • J. Comput.-Aided Mol. Des. (June 2010) O OH For registration: HN Canonical tautomer does not have to be N • the chemically most reasonable • the lowest in energy Canonical tautomer has to be identical for every tautomer 5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI- Tautomer Detection M,Z = trivalent N, bivalent O, S, Se, Te M Q ZH MH Q Z Q = C, N, S, P, Sb, As, Se, Te, Br, Cl, I M Z− M− Q Z H = H, D, T R2 R2 R3 R3 R2 R4 R2 R4 R1 R3 R1 R3 R1 R5 R1 R5 N N N N H H Z MH ZH M R2 R2 Z ZH R1 R3 R1 R3 R2 R3 R2 R3 R1 N R4 R1 N R4 H Z MH ZH M 5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI - Handling Tautomers O H O H O H H N N N N N N H H H N N N N H N N N N N H H Same InChI 11 4 7 for all 15 10 2 guanine 1 5 3 8 tautomers! 6 9 InChI = 1S/C5H5N5O/c6-5-93-2(4(11)10-5)7-1-8-3/h1H,(H4,6,7,8,9,10,11) hydrogen layer 5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES – Handling tautomers • Identifying and handling tautomers is not part of the language • Tautomeric structures are explicitly specified – No tautomeric bond – No mobile hydrogens c1[nH]c2c(n1)[nH]c(nc2=O)N c1[nH]c2c(=O)[nH]c(nc2n1)N O O O O O H H H N N N N N c1[nH]c2c(n1)c(=O)nc([nH]2)N N N HN HN HN c1[nH]c2c(n1)c(=O)[nH]c(n2)N N N N H N N H N N H N N N H N N N HN N c1[nH]c2c(n1)[nH]c(=N)[nH]c2=O 2 H 2 H H 2 2 H H c1[nH]c2c(n1)c(=O)[nH]c(=N)[nH]2 c1[nH]c2c(n1)nc(nc2O)N O OH OH OH OH H N N N N N c1nc-2c(nc([nH]c2n1)N)O HN N N HN HN c1[nH]c2c(n1)c(nc(n2)N)O N N N N HN N HN N HN N HN N HN N N c1nc-2c([nH]c(nc2n1)N)O H H H H H H H c1[nH]c2c(n1)[nH]c(=N)nc2O c1[nH]c2c([nH]c(=N)nc2n1)O OH OH OH OH OH H H c1[nH]c2c(n1)c(nc(=N)[nH]2)O N N N N N N N N HN HN c1nc-2c([nH]c(=N)[nH]c2n1)O N N N N H N N H2N N N N N c1[nH]c2c(n1)c([nH]c(=N)n2)O H2N 2 H H H2N HN N 5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES/InChI Generation Canonical SMILES Input structure InChI (OpenEye) Pre-processing Pre-processing [Normalization] • bond normalization • aromaticity perception \ • disconnect salts/metal atoms • eliminating radicals • normalize mobile hydrogens, variable protonation and charge QuacPac • tautomer detection Canonicalization Canonicalization (partitioning algorithm) (Morgan algorithm) SMILES generation InChI generation 5th Joint Sheffield Conference on Chemoinformatics July, 2010 QuacPac – Canonical Tautomer O O O H N N N N N N H H N N N N N N N N N H H donor acceptor 1. atom typing 2. removing (+counting) hydrogens 3. unspecifying bond types Global algorithm that 4. enumerating tautomers (DFS) systematically generates • protonating/deprotonating atom all tautomers! 5. Empirical, rule-based scoring to find the canonical tautomer 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Validation – Canonical Tautomer Test Set 40K structures tautomer set of an input guanine Generate all tautomers Generate canonical tautomer O H N N H N N N 2 H same no Identical canonical Canonical Tautomer tautomer yes 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Tautomer Generation for SMILES NH2 • Input: 40K N • Generated tautomers: 230K • Failures: 272 (0.007%) NH NH • Failures due to changing 2 N E/Z geometry 269 N H NH 2 NH2 N N C/N=C(/CCC)N CN=C(CCC)N 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Identifying Duplicate Compounds ~ 13K duplicates identified by both TauSMILES InChI methods 14000 24 14000 60 12000 1532 12000 334 10000 10000 8000 8000 6000 6000 4000 4000 10 21 2000 2000 Number of duplicates of Number 1 duplicates of Number 0 0 0 CMC MDDR NCI ACD CMC MDDR NCI ACD CMC MDDR NCI ACD Both InChi & TauSMILES 14 704 12784 8940 Only InChI 1 10 24 1532 Only TauSMILES 0 21 60 334 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Duplicates O O O OH O O NSC numbers: 49178,51705,83559 ,83560,83561, NSC numbers: NSC numbers: 83562,97202,112278,187662 74869 660048 O OH OH O H H H H N N N N + – + – N O N O HO NH O NH2 O O NSC numbers: NSC numbers: NSC numbers: NSC numbers: 60188 408479 98689 98690 5th Joint Sheffield Conference on Chemoinformatics July, 2010 Differences – Tautomer recognition N O N OH N N OH O c1ccn2c(c1)nc(=O)cc2O C8H6N2O2/c11-7-5-8(12)10-4-2-1-3-6(10)9-7/h1-5,12H
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages33 Page
-
File Size-