HOLMES: a Hybrid Ontology-Learning Materials Engineering System
Total Page:16
File Type:pdf, Size:1020Kb
HOLMES: A Hybrid Ontology-Learning Materials Engineering System Miguel Francisco Miravite Remolona Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2018 © 2018 Miguel Francisco Miravite Remolona All rights reserved ABSTRACT HOLMES: A Hybrid Ontology-Learning Materials Engineering System Miguel Francisco Miravite Remolona Designing and discovering novel materials is challenging problem in many domains such as fuel additives, composites, pharmaceuticals, and so on. At the core of all this are models that capture how the different domain-specific data, information, and knowledge regarding the structures and properties of the materials are related to one another. This dissertation explores the difficult task of developing an artificial intelligence-based knowledge modeling environment, called Hybrid Ontology-Learning Materials Engineering System (HOLMES) that can assist humans in populating a materials science and engineering ontology through automatic information extraction from journal article abstracts. While what we propose may be adapted for a generic materials engineering application, our focus in this thesis is on the needs of the pharmaceutical industry. We develop the Columbia Ontology for Pharmaceutical Engineering (COPE), which is a modification of the Purdue Ontology for Pharmaceutical Engineering. COPE serves as the basis for HOLMES. The HOLMES framework starts with journal articles that are in the Portable Document Format (PDF) and ends with the assignment of the entries in the journal articles into ontologies. While this might seem to be a simple task of information extraction, to fully extract the information such that the ontology is filled as completely and correctly as possible is not easy when considering a fully developed ontology. In the development of the information extraction tasks, we note that there are new problems that have not arisen in previous information extraction work in the literature. The first is the necessity to extract auxiliary information in the form of concepts such as actions, ideas, problem specifications, properties, etc. The second problem is in the existence of multiple labels for a single token due to the existence of the aforementioned concepts. These two problems are the focus of this dissertation. In this work, the HOLMES framework is presented as a whole, describing our successful progress as well as unsolved problems, which might help future research on this topic. The ontology is then presented to help in the identification of the relevant information that needs to be retrieved. The annotations are next developed to create the data sets necessary for the machine learning algorithms to perform. Then, the current level of information extraction for these concepts is explored and expanded. This is done through the introduction of entity feature sets that are based on previously extracted entities from the entity recognition task. And finally, the new task of handling multiple labels for tagging a single entity is also explored by the use of multiple-label algorithms used primarily in image processing. Table of Contents List of Figures ............................................................................................................................... vii List of Tables ............................................................................................................................... viii Acronyms .................................................................................................................................. ix Acknowledgements ......................................................................................................................... x Chapter 1. Introduction ............................................................................................................... 1 1.1. Overview of HOLMES .................................................................................................... 8 1.2. Some Challenges with this Approach ............................................................................ 10 1.3. Objectives of Thesis ....................................................................................................... 13 1.4. Outline of Thesis ............................................................................................................ 14 Chapter 2. Hybrid Ontology-Learning Material Engineering System (HOLMES) Framework .. ................................................................................................................................. 16 2.1. Image Processing............................................................................................................ 19 2.1.1. Document Image Analysis ...................................................................................... 19 2.1.2. Formula Processing ................................................................................................. 25 2.1.3. Other Image Processing Tasks ................................................................................ 29 2.1.4. Unknown Figures .................................................................................................... 32 2.2. Natural Language Processing ......................................................................................... 32 2.2.1. Entity Recognition and Concept Detection ............................................................. 34 2.2.2. Relation Extraction and Clustering ......................................................................... 36 2.3. Other Extracted Information .......................................................................................... 38 i 2.3.1. Metadata and References ........................................................................................ 38 2.3.2. Tables ...................................................................................................................... 40 2.3.3. Headers and Footers ................................................................................................ 41 2.3.4. Captions .................................................................................................................. 42 2.3.5. Coreference Resolution and Entity Linking ........................................................... 42 2.4. Integrated Tasks ............................................................................................................. 43 2.4.1. Document Sectioning and Formula Processing ...................................................... 43 2.4.2. Entity Recognition and Concept Detection ............................................................. 44 2.4.3. Entity Recognition and Concept Detection, with Relation Extraction and Clustering ................................................................................................................................. 44 2.4.4. Metadata and Natural Language Processing ........................................................... 44 2.5. Programming Structure .................................................................................................. 45 2.6. Overview of General Machine Learning Algorithms .................................................... 47 2.6.1. Support Vector Machines ....................................................................................... 47 2.6.2. Hidden Markov Models .......................................................................................... 48 2.6.3. Conditional Random Fields .................................................................................... 50 2.6.4. Feature Spaces ........................................................................................................ 51 2.6.5. Active Learning ...................................................................................................... 51 2.7. Holmes Summary ........................................................................................................... 52 Chapter 3. Ontologies ............................................................................................................... 53 ii 3.1. Ontologies Overview...................................................................................................... 53 3.1.1. Value and Dimensions Ontology ............................................................................ 61 3.1.2. Scientific Concept Ontology ................................................................................... 61 3.1.3. Mathematical Model Ontology ............................................................................... 61 3.1.4. Physical Object Ontology ....................................................................................... 61 3.1.5. General Properties Ontology................................................................................... 62 3.1.6. Physical and Chemical Properties Ontology........................................................... 62 3.1.7. Materials Ontology ................................................................................................. 62 3.1.8. Pure Chemical Substance Ontology........................................................................ 62 3.1.9. Process and Unit Operations Ontology ................................................................... 63 3.1.10. Physical, Chemical, and Biological Reaction Ontology ..................................... 63 3.2. Data in Ontologies .........................................................................................................