Extraction of Chemical Structures and Reactions from the Literature
Total Page:16
File Type:pdf, Size:1020Kb
DEPARTMENT OF CHEMISTRY Extraction of chemical structures and reactions from the literature Daniel Mark Lowe Pembroke College This dissertation is submitted for the degree of Doctor of Philosophy June 2012 Disclaimer This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation does not exceed the word limit (60000) set by the Chemistry Degree Committee. I Abstract The ever increasing quantity of chemical literature necessitates the creation of automated techniques for extracting relevant information. This work focuses on two aspects: the conversion of chemical names to computer readable structure representations and the extraction of chemical reactions from text. Chemical names are a common way of communicating chemical structure information. OPSIN (Open Parser for Systematic IUPAC Nomenclature), an open source, freely available algorithm for converting chemical names to structures was developed. OPSIN employs a regular grammar to direct tokenisation and parsing leading to the generation of an XML parse tree. Nomenclature operations are applied successively to the tree with many requiring the manipulation of an in-memory connection table representation of the structure under construction. Areas of nomenclature supported are described with attention being drawn to difficulties that may be encountered in name to structure conversion. Results on sets of generated names and names extracted from patents are presented. On generated names, recall of between 96.2% and 99.0% was achieved with a lower bound of 97.9% on precision with all results either being comparable or superior to the tested commercial solutions. On the patent names OPSIN’s recall was 2-10% higher than the tested solutions when the patent names were processed as found in the patents. The uses of OPSIN as a web service and as a tool for identifying chemical names in text are shown to demonstrate the direct utility of this algorithm. A software system for extracting chemical reactions from the text of chemical patents was developed. The system relies on the output of ChemicalTagger, a tool for tagging words and identifying phrases of importance in experimental chemistry text. Improvements to this tool required to facilitate this task are documented. The structure of chemical entities are where possible determined using OPSIN in conjunction with a dictionary of name to structure relationships. Extracted reactions are atom mapped to confirm that they are chemically consistent. 424,621 atom mapped reactions were extracted from 65,034 organic chemistry USPTO patents. On a sample of 100 of these extracted reactions chemical entities were identified with 96.4% recall and 88.9% precision. Quantities could be associated with reagents in 98.8% of cases and 64.9% of cases for products whilst the correct role was assigned to chemical entities in 91.8% of cases. Qualitatively the system captured the essence of the reaction in 95% of cases. This system is expected to be useful in the creation of searchable databases of reactions from chemical patents and in facilitating analysis of the properties of large populations of reactions. II Acknowledgements I would like to thank my supervisors, Professor Robert Glen and Professor Peter Murray-Rust, for their guidance and advice. I would also like to thank Dr Peter Corbett, for his initial work on the OPSIN codebase which was the precursor to the system that I developed, Dr Lezan Hawizy for her work on ChemicalTagger and many useful discussions on extending it, Dr David Jessop for his paragraph classifier, Albina Asadulina for her contribution to fused ring nomenclature support and Dr Sam Adams for many fruitful discussions on cheminformatics algorithms. I would also like to thank my colleagues at the Unilever Centre for providing such an enjoyable working environment. I am very grateful to Boehringer Ingelheim for funding my research. III Table of Contents Disclaimer ...................................................................................................................................... I Abstract ......................................................................................................................................... II Table of Contents ......................................................................................................................... IV Glossary ..................................................................................................................................... XIII Chapter 1 Introduction ................................................................................................................. 1 1.1 Where can text mining be performed? .............................................................................. 3 1.2 What can be text mined? ................................................................................................... 4 1.3 Overview of research project ............................................................................................. 4 Chapter 2 Tools and Methods ...................................................................................................... 6 2.1 XML ..................................................................................................................................... 6 2.2 Chemical Markup Language ............................................................................................... 7 2.3 SMILES................................................................................................................................. 8 2.4 InChI .................................................................................................................................. 10 2.5 Formal grammars .............................................................................................................. 11 2.6 Automata .......................................................................................................................... 12 2.7 Regular expressions .......................................................................................................... 14 2.8 OSCAR4 ............................................................................................................................. 15 2.9 ChemicalTagger ................................................................................................................ 18 2.10 Apache Maven ................................................................................................................ 22 2.11 Distributed version control ............................................................................................. 25 2.12 Continuous integration testing ....................................................................................... 26 IV Chapter 3 Conversion of Chemical Names to Structures ........................................................... 28 3.1 Introduction ...................................................................................................................... 28 3.1.1 History of systematic nomenclature.......................................................................... 28 3.1.1 Classes of chemical name .......................................................................................... 29 3.1.2 General construction of systematic names ............................................................... 29 3.1.3 History of programmatic name to structure conversion ........................................... 31 3.1.4 Current solutions ....................................................................................................... 32 3.2 Development and implementation of OPSIN ................................................................... 34 3.2.1 Strategy for development of OPSIN .......................................................................... 34 3.2.2 Architecture ............................................................................................................... 34 3.2.3 Pre-processing ........................................................................................................... 35 3.2.4 Tokenisation and parsing ........................................................................................... 36 3.2.4.1 Introduction ........................................................................................................ 36 3.2.4.2 Tokenisation algorithm ....................................................................................... 37 3.2.4.3 Looking up tokens in the lexicon ........................................................................ 40 3.2.4.4 Generation of parses .......................................................................................... 42 3.2.4.5 Drawbacks of a regular grammar ....................................................................... 44 3.2.4.6 Right to left parsing ............................................................................................ 44 3.2.4.7 XML generation .................................................................................................. 45 3.2.5 CAS index name uninversion ..................................................................................... 45 3.2.6 Chemical word rule assignment ................................................................................ 47 3.2.7 Component generation ............................................................................................