Information Extraction from Chemical Patents
Total Page:16
File Type:pdf, Size:1020Kb
Information Extraction from Chemical Patents David Matthew Jessop Fitzwilliam College This dissertation is submitted for the degree of Doctor of Philosophy Preface This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text This dissertation does not exceed the word limit (60000) set by the Degree Committee i Abstract Information Extraction from Chemical Patents David Matthew Jessop The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature. Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use. PatentEye – an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) – is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye – 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large- scale and automated extraction of high-quality reaction information. Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community. ii Acknowledgments I would like to thank Prof. Robert Glen and Prof. Peter Murray-Rust for supervision. I would also like to thank all those who have contributed to the creation of the software that has made this project possible – most notably Dr Peter Corbett for his work on OSCAR3, Dr Lezan Hawizy for her work on ChemicalTagger and Daniel Lowe for his work on OPSIN. Further thanks go to all those too numerous to name at the Unilever Centre, past and present, who have contributed to discussions and supported me in my work. Special thanks go to Dr Sam Adams, who volunteered to proof read this thesis, and to Jo for her love and support. I am grateful to Unilever for funding. iii Contents Preface ..................................................................................................................................................... i Abstract ................................................................................................................................................... ii Acknowledgments .................................................................................................................................. iii Contents ................................................................................................................................................. iv List of Figures ........................................................................................................................................ vii Glossary .................................................................................................................................................. ix 1. Introduction .................................................................................................................................... 1 1.1 Open and Closed Data ............................................................................................................ 1 1.2 The Semantic Web .................................................................................................................. 2 1.3 Semanticizing Chemistry ......................................................................................................... 5 1.4 Information Extraction from Chemical Documents ................................................................ 5 1.5 Development Environment ..................................................................................................... 7 2. Sources of Chemical Documents & Technologies for their Semantic Enrichment ......................... 8 2.1 Availability of Documents ....................................................................................................... 8 2.1.1 Journal Articles ................................................................................................................ 9 2.1.2 Theses ............................................................................................................................. 9 2.1.3 Patents .......................................................................................................................... 10 2.2 Key Technologies .................................................................................................................. 14 2.2.1 XML & XPath ................................................................................................................. 15 2.2.2 Regular Expressions ...................................................................................................... 16 2.2.3 Machine-Understandable Chemical Formats ............................................................... 17 2.2.4 Chemical Markup Language .......................................................................................... 21 2.2.5 CMLXOM & JUMBO ....................................................................................................... 25 2.2.6 OSCAR3 ......................................................................................................................... 26 2.2.7 ChemicalTagger ............................................................................................................. 39 2.2.8 OSRA .............................................................................................................................. 45 2.3 Conclusions ........................................................................................................................... 47 3. Representation and Manipulation of Markush Structures ........................................................... 48 3.1 Markush Structures ............................................................................................................... 49 3.2 Polymer Markup Language ................................................................................................... 50 3.2.1 Representation of Polyethylene Oxide ......................................................................... 50 iv 3.2.2 Representation of Polystyrene ..................................................................................... 53 3.2.3 Representing Variability in PML .................................................................................... 55 3.2.4 The Cambridge Polymer Builder ................................................................................... 58 3.3 Extension of PML for Markush Structures ............................................................................ 61 3.3.1 Frequency Variation ...................................................................................................... 63 3.3.2 Homology Variation ...................................................................................................... 63 3.3.3 Position Variation .......................................................................................................... 66 3.3.4 Position and Count Variation ........................................................................................ 67 3.3.5 Inline Connection Tables ............................................................................................... 68 3.4 Building Representative Examples of a Markush Structure.................................................. 70 3.5 Substructure Searching of Markush Structures .................................................................... 75 3.5.1 Implementing Extended Connection Tables ................................................................. 76 3.5.2 Building Extended Connection Tables ........................................................................... 78 3.5.3 The Relaxation Algorithm.............................................................................................. 82 3.5.4 Examples ....................................................................................................................... 87 3.6 Conclusions ........................................................................................................................... 91 4. Automatic Acquisition of Hyponymic Relations from the Chemical Literature ............................ 93 4.1 Hyponymic Relations ...........................................................................................................