Canonical Line Notations
InChI vs SMILES
Krisztina Boda
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Overview
• Compound naming • InChI • SMILES • Molecular equivalency – Isomorphism – Kekule – Tautomers • Finding duplicates
5th Joint Sheffield Conference on Chemoinformatics July, 2010 OH H N O
What’s Your Name?
1. Unique numbers one-to-one • CAS registry number OH • BRN Beilstein Registry Number • EC (European commission) number CAS: 108-95-2 • CID (PubChem compound id)
one-to-many 2. Chemical Nomenclatures • IUPAC name(s) IUPAC [ENG] phenol • Traditional name(s) IUPAC [HUN] fenol [carbolic acid, benzenol, phenylic acid, std.]
5th Joint Sheffield Conference on Chemoinformatics July, 2010 What’s Your Name?
3. Hashed identifiers many-to-one • NCI/CADD structure identifiers • InChIKey OH
InChIKey= ISWSIDIOOBJBQZ-UHFFFAOYSA-N
4. Chemical line notations one-to-one • WLN (Wiswesser Line Notation) • SMILES • InChI InChI=1S/C6H6O/c7-6-4-1-3-5-6/h1-5,7H
c1ccccc1O
5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI
InChI Identifier (International Chemical Identifier) • Developed by IUPAC and NIST [first release 2005] • One reference implementation • Multiple hierarchical layers – No bond order OH • Not a file format! main layer H O 35 Cl + formula connection static and mobile H NH3
InChI=1S/C2H4ClNO2/c3-1(4)2(5)6/h1H,4H2,(H,5,6)/p+1/t1-/m1/s1/i3+0
version charge stereo isotope
5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES
SMILES (Simplified Molecular Input Line Entry Specification) OH c1ccccc1O Oc1ccccc1 C1=CC=CC=C1O C1=CC=C(C=C1)O [CH]1=[CH][CH]=[CH][CH]=C1[OH] • Canonical SMILES [OEChem: c1ccc(cc1)O] – Unique name for each molecule in one system – Not a global identifier • Canonical Isomeric SMILES OH – Encode isotope, double bond and chiral configuration H C(C(=O)O)([NH3+])Cl O 35 [C@H](C(=O)O)([NH3+])[35Cl] Cl + NH3 [1] SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules David Weininger, J. Chem. Inf. Comput. Sci., 1988 (28)
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Molecule Registration
4 Is this already 5 in my 3 2 1 7 database? 6
OH 1 OH 2 N 7 N 3
6 4 O 5
HN
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Molecule Equivalency - Isomorphism
• NP problem
1 4 OH • Complexity = HvyAtm! 2 5 • Impractical for molecule 7 3 3N 2 registration! 6 4 1 7 5 6
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Canonicalization
1 OH 3 • Transforms any 2 4 connection table into a 7 N 3 6 5 unique canonical form (independent of atom 6 4 1 2 5 7 indices) • Comparing molecules comparing unique, unambiguous string representations
5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES/InChI Generation
Canonical SMILES Input structure InChI (OpenEye)
Pre-processing Pre-processing [Normalization] • bond normalization • aromaticity perception \ • disconnect salts/metal atoms • eliminating radicals • normalize mobile hydrogens, variable protonation and charge • tautomer detection
Canonicalization Canonicalization (partitioning algorithm) (Morgan algorithm)
SMILES generation InChI generation
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Canonicalization
Finding unique atom ordering 1. Initial partitioning of atoms by distinguishing them by their local properties.
O N O O N
2. Recursively refining atom classes by distinguishing atoms by their neighbors. 3. When each atom belong to a unique partition unique canonical atom order exists.
5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES - Generation
O N 1 2 3 3 O O N 1 2
c1cc2cc([nH]c2cc1C[C@@H](CC(=O)O)N)/C=C/C3CCC(=O)C3
Based on the canonical ordering of atoms, a unique SMILES is generated by depth-first search (DFS).
5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES Validation
Test Set Name has to be 10M compounds independent of atom (ACD, Aldrich, Maybridge, MDDR, NCI etc.) numbering! 1 OH 5 repeat 100 × 2 7 7 N 3 6 1 … Randomly Renumbering Atoms 6 4 4 3 5 2 Canonical IsoSMILES Generation c1ccnc(c1)O
Identical no String yes CT SMILES CT SMILES
5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES Validation
7 63 8 B B H H + B N B N N B B + H B N – O B B B O O N H Co O N H O H O – + B N H
H H SMILES bond adamantane-like stereo ring systems highly-complex ring representation systems problem
Number of total failures: 78 (0.0008%)
5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI Validation
Test Set Name has to be 10M compounds independent of atom (ACD, Aldrich, Maybridge, MDDR, NCI etc.) numbering! 1 OH 5 repeat 100 × 2 7 7 N 3 6 1 … Randomly Renumbering Atoms 6 4 4 3 5 2
InChI Generation InChI=1S/C5H5NO/c7-5-3-1-2-4-5/h1-4H,(H,6,7)
Identical no String yes SDF InChI SDF InChI
5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI Validation
28 27 H N O N N + N N NH H O HN – H O OH OH H O H + N N S N N + O S N H H H N HO charge layer stereo layer
Number of total failures: 55 (0.0006%)
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Molecule Equivalency (Kekule)
Canonical SMILES • aromatic form (no OH OH bond between aromatic atoms) N N InChI C1(=CC=CC=N1)O C=1(C=CC=CN1)O • no bond information
c1cnc(O)ccc1 InChI=1S/c5H5NO/c7-5-3-1-2-4-6-5/h1-4H,(H,6,7)
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Molecule Equivalency - Tautomer
• J. Comput.-Aided Mol. Des. (June 2010) OH O For registration: N HN Canonical tautomer does not have to be • the chemically most reasonable • the lowest in energy Canonical tautomer has to be identical for every tautomer
5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI- Tautomer Detection
M,Z = trivalent N, bivalent O, S, Se, Te M Q ZH MH Q Z Q = C, N, S, P, Sb, As, Se, Te, Br, Cl, I M Z− M− Q Z H = H, D, T
R2 R2 R3 R3 R2 R4 R2 R4 R1 R3 R1 R3 R1 R5 R1 R5 N N N N H H Z MH ZH M
R2 R2 Z ZH R1 R3 R1 R3 R2 R3 R2 R3
R1 N R4 R1 N R4 H Z MH ZH M
5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI - Handling Tautomers
O H O H O H H N N N N N N H H H N N N N H N N N N N H H
Same InChI 11 4 7 for all 15 10 2 guanine 1 5 3 8 tautomers! 6 9
InChI = 1S/C5H5N5O/c6-5-93-2(4(11)10-5)7-1-8-3/h1H,(H4,6,7,8,9,10,11)
hydrogen layer
5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES – Handling tautomers
• Identifying and handling tautomers is not part of the language • Tautomeric structures are explicitly specified – No tautomeric bond – No mobile hydrogens
c1[nH]c2c(n1)[nH]c(nc2=O)N c1[nH]c2c(=O)[nH]c(nc2n1)N O O O O O H H H N N N N N c1[nH]c2c(n1)c(=O)nc([nH]2)N N N HN HN HN c1[nH]c2c(n1)c(=O)[nH]c(n2)N N N N H N N H N N H N N N H N N N HN N c1[nH]c2c(n1)[nH]c(=N)[nH]c2=O 2 H 2 H H 2 2 H H c1[nH]c2c(n1)c(=O)[nH]c(=N)[nH]2 c1[nH]c2c(n1)nc(nc2O)N O OH OH OH OH H N N N N N c1nc-2c(nc([nH]c2n1)N)O HN N N HN HN c1[nH]c2c(n1)c(nc(n2)N)O N N N N HN N HN N HN N HN N HN N N c1nc-2c([nH]c(nc2n1)N)O H H H H H H H c1[nH]c2c(n1)[nH]c(=N)nc2O c1[nH]c2c([nH]c(=N)nc2n1)O OH OH OH OH OH H H c1[nH]c2c(n1)c(nc(=N)[nH]2)O N N N N N N N N HN HN c1nc-2c([nH]c(=N)[nH]c2n1)O N N N N H N N H2N N N N N c1[nH]c2c(n1)c([nH]c(=N)n2)O H2N 2 H H H2N HN N
5th Joint Sheffield Conference on Chemoinformatics July, 2010 SMILES/InChI Generation
Canonical SMILES Input structure InChI (OpenEye)
Pre-processing Pre-processing [Normalization] • bond normalization • aromaticity perception \ • disconnect salts/metal atoms • eliminating radicals • normalize mobile hydrogens, variable protonation and charge QuacPac • tautomer detection
Canonicalization Canonicalization (partitioning algorithm) (Morgan algorithm)
SMILES generation InChI generation
5th Joint Sheffield Conference on Chemoinformatics July, 2010 QuacPac – Canonical Tautomer
O O O H N N N N N N H H N N N N N N N N N H H donor acceptor 1. atom typing 2. removing (+counting) hydrogens 3. unspecifying bond types Global algorithm that 4. enumerating tautomers (DFS) systematically generates • protonating/deprotonating atom all tautomers! 5. Empirical, rule-based scoring to find the canonical tautomer
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Validation – Canonical Tautomer
Test Set 40K structures tautomer set of an input guanine
Generate all tautomers
Generate canonical tautomer
O same H no N Identical canonical N Canonical tautomer H N N N Tautomer 2 H yes
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Tautomer Generation for SMILES
NH2
• Input: 40K N • Generated tautomers: 230K • Failures: 272 (0.007%) NH NH • Failures due to changing 2 N E/Z geometry 269 N H
NH 2 NH2
N N C/N=C(/CCC)N CN=C(CCC)N
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Identifying Duplicate Compounds ~ 13K duplicates identified by both TauSMILES InChI methods
14000 24 14000 60 12000 1532 12000 334 10000 10000
8000 8000
6000 6000
4000 4000 10 21
2000 2000 Number of duplicates of Number 1 duplicates of Number 0 0 0 CMC MDDR NCI ACD CMC MDDR NCI ACD
CMC MDDR NCI ACD Both InChi & TauSMILES 14 704 12784 8940 Only InChI 1 10 24 1532 Only TauSMILES 0 21 60 334
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Duplicates
O O O
OH O O NSC numbers: 49178,51705,83559 ,83560,83561, NSC numbers: NSC numbers: 83562,97202,112278,187662 74869 660048
O OH
OH O
H H H H N N N N + – + – N O N O
HO NH O NH2 O O
NSC numbers: NSC numbers: NSC numbers: NSC numbers: 60188 408479 98689 98690
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Differences – Tautomer recognition
N O N OH
N N
OH O c1ccn2c(c1)nc(=O)cc2O
C8H6N2O2/c11-7-5-8(12)10-4-2-1-3-6(10)9-7/h1-5,12H
C8H6N2O2/c11-7-5-8(12)10-4-2-1-3-6(10)9-7/h1-5,11H
5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI Removes Metal Bonds
N N N N N N Na N N
S S Na
C7H6N4S.Na/c12-7-8-9-10-11(7)6-4-2-1-3-5-6;/h1- 5H,(H,8,10,12);/q;+1/p-1
c1ccc(cc1)[n@]2c(nnn2)S[Na]
c1ccc(cc1)[n@]2c(=S)[n@](nn2)[Na]
5th Joint Sheffield Conference on Chemoinformatics July, 2010 InChI Automatically Cleans Stereo
N N O S O SH
N H N
C5H8N2OS/c1-2-7-4(8)3-6-5(7)9/h2-3H2,1H3,(H,6,9)
CC[N@]1C(=O)CNC1=S
CC[N@@]1C(=O)CNC1=S
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Conclusion
• Consistent unique • Algorithm dependent identifier (because unique identifier reference implementation) • Tautomer handling can be • Normalization/tautomer performed by separate identification part of the application language
INCHI SMILES Human readable/writable NO YES Compact NO YES Expresses similarity (LINGOs) NO YES
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Acknowledgments
Roger Sayle
Brian Cole Ben Ellingson Paul Hawkins
5th Joint Sheffield Conference on Chemoinformatics July, 2010 Thank you for your attention!
For more information, please contact us. [email protected] [email protected] www.eyesopen.com 505-473-7385
5th Joint Sheffield Conference on Chemoinformatics July, 2010