Computer-Aided Information Retrieval and Management System from Scientific Documents

Computergestützte Informationsbeschaffung und -verwaltung aus wissenschaftlichen Dokumenten Computer-aided information retrieval and management system from scientific documents Zur Erlangung des akademischen Grades eines DOKTORS DER NATURWISSENSCHAFTEN (Dr. rer. nat.) von der KIT-Fakultät für Chemie und Biowissenschaften des Karlsruher Instituts für Technologie (KIT) genehmigte DISSERTATION von M.Sc. Thanh Cam An Nguyen aus Hue, Vietnam Dekan: Prof. Dr. Manfred Wilhelm Referent: Prof. Dr. Stefan Bräse Koreferent: Prof. Dr. Ralf H. Reussner Tag der mündlichen Prüfung: 12.12.2019 Für meine Familie, meine Freunde und mich. Probleme kann man niemals mit derselben Denkweise lösen, durch die sie entstanden sind. - Albert Einstein Die vorliegende Arbeit wurde in der Zeit vom 15.06.2016 bis 06.11.2019 am Institut für Organische Chemie der Fakultät für Chemie und Biowissenschaften am Karlsruher Institut für Technologie (KIT) Campus Nord unter der Leitung von Prof. Dr. Stefan Bräse angefertigt. Hiermit versichere ich, die vorliegende Dissertation selbstständig verfasst und ohne unerlaubte Hilfsmittel angefertigt zu haben. Es wurden keine anderen als die angegebenen Quellen und Hilfsmittel benutzt. Die aus Quellen – wörtlich oder inhaltlich – entnommenen Stellen wurden als solche kenntlich gemacht. Die Arbeit wurde bisher weder in gleicher, noch in ähnlicher Form einer anderen Prüfungsbehörde vorgelegt oder veröffentlicht. Ich habe die Regeln zur Sicherung guter wissenschaftlicher Praxis im Karlsruher Institut für Technologie (KIT) beachtet. Table of Contents Table of Contents ....................................................................................................... i Table of Figures......................................................................................................... v Summary .................................................................................................................. ix Introduction ........................................................................................... 1 1.1 Motivation .................................................................................................... 3 1.2 Problem Statement ....................................................................................... 4 1.3 Related Studies ............................................................................................. 5 1.4 Overview of the research project .................................................................. 5 Background ............................................................................................ 7 2.1 Molecular representing ................................................................................. 7 2.1.1. SMILES ................................................................................................ 7 2.1.2. Molfile................................................................................................. 11 2.1.3. Molecular fingerprints ......................................................................... 14 2.2 ChemDraw related formats ......................................................................... 18 2.2.1. ChemDraw Binary Format (CDX) ....................................................... 19 2.2.2. Embedded ChemDraw in Microsoft Word documents.......................... 23 2.2.3. ChemDraw XML (CDXML) ............................................................... 28 2.3 Literatures Review ..................................................................................... 29 2.3.1. Chemical Entity Recognition ............................................................... 29 2.3.2. Optical Chemical Structure Recognition .............................................. 32 2.3.3. ChemDraw Mining .............................................................................. 34 ChemScanner ....................................................................................... 36 3.1 File Reader ................................................................................................. 37 3.1.1. ChemDraw Parser ................................................................................ 39 i 3.2 Pre-processing ............................................................................................ 45 3.2.1. Extract nested objects .......................................................................... 45 3.2.2. Graphic refinement .............................................................................. 47 3.2.3. Arrow refinement ................................................................................ 49 3.2.4. Molecule compilation .......................................................................... 52 3.3 Scheme Analysis ........................................................................................ 55 3.3.1. ChemScanner Arrow............................................................................ 55 3.3.2. Analysis Procedure .............................................................................. 56 3.3.3. Arrow area builder ............................................................................... 57 3.3.4. Role detection ...................................................................................... 59 3.4 Special case processing .............................................................................. 61 3.4.1. Molecule group .................................................................................... 61 3.4.2. Multiple line chain reactions ................................................................ 63 3.5 Text processing .......................................................................................... 65 3.5.1. Abbreviations and superatoms ............................................................. 65 3.5.2. Text assignment ................................................................................... 66 3.5.3. Molecular text processing .................................................................... 68 3.5.4. Reaction text processing ...................................................................... 71 3.6 Post-processing .......................................................................................... 72 3.6.1. Text as molecules ................................................................................ 72 3.6.2. Multi-step reaction ............................................................................... 74 3.6.3. Generation of new elements ................................................................. 76 ChemScanner User Interface ................................................................ 79 4.1 Overview .................................................................................................... 79 4.2 Abbreviation and Superatom management .................................................. 82 4.3 Scanned Output .......................................................................................... 85 4.3.1. Results display ..................................................................................... 86 ii 4.3.2. Results edit .......................................................................................... 89 4.4 File Storage ................................................................................................ 93 4.4.1. Backend classes ................................................................................... 93 4.4.2. File Storage interface ........................................................................... 93 Chemotion ELN integration ............................................................... 100 5.1 Chemical searching .................................................................................. 100 5.1.1. Exact structure search ........................................................................ 100 5.1.2. Substructure search ............................................................................ 101 5.1.3. Similarity search ................................................................................ 104 5.2 Computational molecule properties interface ............................................ 108 5.3 Green Chemistry ...................................................................................... 113 5.3.1. Atom Economy .................................................................................. 114 5.3.2. Environmental (E) factor ................................................................... 115 5.3.3. Green Chemistry Interface ................................................................. 116 Summary and conclusions .................................................................. 118 Further Impact ................................................................................................... 119 Areas for future work ......................................................................................... 119 Abbreviations ..................................................................................... 121 Literatures .......................................................................................... 125 Appendix ........................................................................................... 133 9.1 ChemScanner Usage and APIs ................................................................. 133 9.1.1. Usage ................................................................................................ 133 9.1.2. API .................................................................................................... 133 9.2 Acknowledgments

Computer-Aided Information Retrieval and Management System from Scientific Documents

Quantitative Structure–Activity Relationships (Qsar) Study on Novel 4-Amidinoquinoline and 10- Amidinobenzonaphthyridine Derivatives As Potent Antimalaria Agent

Reversal of Multidrug Resistance in Mouse Lymphoma Cells by Extracts and Flavonoids from Pistacia Integerrima

Molecular Structure Input on the Web Peter Ertl

A Web-Based 3D Molecular Structure Editor and Visualizer Platform

Spoken Tutorial Project, IIT Bombay Brochure for Chemistry Department

Designing Universal Chemical Markup (UCM) Through the Reusable Methodology Based on Analyzing Existing Related Formats

Notes on OLEX2

Visualizing 3D Molecular Structures Using an Augmented Reality App

Organi &C Biomolecular Chemistry

Molecular Dynamics Simulations of Tight Junction Proteins

Chemdoodle Web Components: HTML5 Toolkit for Chemical Graphics, Interfaces, and Informatics Melanie C Burger1,2*

Interpretation of Molecule Conformations from Drawn Diagrams” by Dana Tenneson, Ph.D., Brown University, May 2008