Adaptive Semantic Annotation of Entity and Concept Mentionsin Text
Total Page:16
File Type:pdf, Size:1020Kb
i ADAPTIVE SEMANTIC ANNOTATION OF ENTITY AND CONCEPT MENTIONS IN TEXT A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy By PABLO N. MENDES M.Sc., Federal University of Rio de Janeiro, 2005 B.Sc., Federal University of Juiz de Fora, 2003 2013 Wright State University Wright State University GRADUATE SCHOOL June 1, 2014 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER- VISION BY Pablo N. Mendes ENTITLED Adaptive Semantic Annotation of Entity and Concept Mentions in Text BE ACCEPTED IN PARTIAL FULFILLMENT OF THE RE- QUIREMENTS FOR THE DEGREE OF Doctor of Philosophy. Amit P. Sheth, Ph.D. Thesis Director Arthur Goshtasby, Ph.D. Director, Computer Science & Engineering Graduate Program R. William Ayres, Ph.D. Interim Dean, Graduate School Committee on Final Examination Amit P. Sheth, Ph.D. Krishnaprasad Thirunarayan, Ph.D. Shajoun Wang, Ph.D. Soren¨ Auer, Ph.D. ABSTRACT Mendes, Pablo N. Ph.D., Department of Computer Science and Engineering, Wright State Univer- sity, 2013. Adaptive Semantic Annotation of Entity and Concept Mentions in Text. The recent years have seen an increase in interest for knowledge repositories that are useful across applications, in contrast to the creation of ad hoc or application-specific databases. These knowledge repositories figure as a central provider of unambiguous iden- tifiers and semantic relationships between entities. As such, these shared entity descriptions serve as a common vocabulary to exchange and organize information in different formats and for different purposes. Therefore, there has been remarkable interest in systems that are able to automatically tag textual documents with identifiers from shared knowledge repositories so that the content in those documents is described in a vocabulary that is unambiguously understood across applications. Tagging textual documents according to these knowledge bases is a challenging task. It involves recognizing the entities and concepts that have been mentioned in a particular passage and attempting to resolve eventual ambiguity of language in order to choose one of many possible meanings for a phrase. There has been substantial work on recognizing and disambiguating entities for specialized applications, or constrained to limited entity types and particular types of text. In the context of shared knowledge bases, since each appli- cation has potentially very different needs, systems must have unprecedented breadth and flexibility to ensure their usefulness across applications. Documents may exhibit different language and discourse characteristics, discuss very diverse topics, or require the focus on parts of the knowledge repository that are inherently harder to disambiguate. In practice, for developers looking for a system to support their use case, is often unclear if an exist- ing solution is applicable, leading those developers to trial-and-error and ad hoc usage of multiple systems in an attempt to achieve their objective. In this dissertation, I propose a conceptual model that unifies related techniques in iii this space under a common multi-dimensional framework that enables the elucidation of strengths and limitations of each technique, supporting developers in their search for a suitable tool for their needs. Moreover, the model serves as the basis for the development of flexible systems that have the ability of supporting document tagging for different use cases. I describe such an implementation, DBpedia Spotlight, along with extensions that we performed to the knowledge base DBpedia to support this implementation. I report evaluations of this tool on several well known data sets, and demonstrate applications to diverse use cases for further validation. iv Contents 1 Introduction1 1.1 Historical Context...............................1 1.2 Use Cases for Semantic Annotation.....................3 1.3 Thesis statement................................5 1.4 Organization..................................6 1.5 Notation....................................6 2 Background and Related Work7 2.1 Methodology for the Literature Review...................7 2.2 A Review of Information Extraction Tasks..................8 2.3 State of the art................................. 17 2.4 Cross-task Benchmarks............................ 18 2.5 Conclusion.................................. 20 3 A Conceptual Framework for Semantic Annotation 21 3.1 Definitions................................... 21 3.1.1 Actors................................. 22 3.1.2 Objective............................... 25 3.1.3 Textual Content............................ 26 3.1.4 Knowledge Base........................... 27 3.1.5 System................................ 29 3.1.6 Annotations.............................. 34 3.2 Conclusion.................................. 35 4 The DBpedia Knowledge Base 36 4.1 Extracting a Knowledge Graph from Wikipedia............... 36 4.1.1 Extracting Triples........................... 37 4.1.2 The DBpedia Ontology........................ 38 4.1.3 Cross-Language Data Fusion..................... 39 4.1.4 RDF Links to other Data Sets.................... 42 4.2 Supporting Natural Language Processing.................. 43 4.2.1 The Lexicalization Data Set..................... 43 v 4.2.2 The Thematic Concepts Data Set................... 45 4.2.3 The Grammatical Gender Data Set.................. 46 4.2.4 Occurrence Statistics Data Set.................... 47 4.2.5 The Topic Signatures Data Set.................... 47 4.3 Conclusion.................................. 48 5 DBpedia Spotlight: A System for Adaptive Knowledge Base Tagging of DBpe- dia Entities and Concepts 49 5.1 Implementation................................ 50 5.1.1 Phrase Recognition.......................... 52 5.1.2 Candidate Selection......................... 56 5.1.3 Disambiguation............................ 56 5.1.4 Relatedness.............................. 59 5.1.5 Tagging................................ 59 5.2 Using DBpedia Spotlight........................... 61 5.2.1 Web Application........................... 62 5.2.2 Web Service............................. 62 5.2.3 Installation.............................. 64 5.3 Continuous Evolution............................. 65 5.3.1 Live updates............................. 65 5.3.2 Feedback incorporation........................ 66 5.3.3 Round-trip semantics......................... 66 5.4 Conclusion.................................. 68 6 Core Evaluations 69 6.1 Evaluation Corpora.............................. 69 6.1.1 Wikipedia............................... 69 6.1.2 CSAW................................ 69 6.2 Spotting Evaluation Results.......................... 70 6.2.1 Error Analysis............................ 72 6.3 Disambiguation Evaluation Results...................... 72 6.3.1 TAC-KBP............................... 74 6.4 Annotation Evaluation Results........................ 76 6.4.1 News Articles............................. 76 6.5 A Framework for Evaluating Difficulty to Disambiguate.......... 79 6.5.1 Comparison with annotation-as-a-service systems.......... 81 6.6 Conclusions.................................. 84 7 Case Studies 85 7.1 Named Entity Recognition in Tweets..................... 85 7.2 Managing Information Overload....................... 88 7.3 Understanding Website Similarity...................... 92 7.3.1 Entity-aware Click Graph...................... 93 7.3.2 Website Similarity Graph....................... 95 7.3.3 Results................................ 95 vi 7.4 Automated Audio Tagging.......................... 96 7.5 Educational Material - Emergency Management............... 97 8 Conclusion 100 8.1 Limitations.................................. 101 Bibliography 102 vii List of Figures 1.1 Example SPO triples describing the Brazilian city of Rio de Janeiro (iden- tified by the URI Rio_de_Janeiro). It describes Rio’s total area in square miles through a property (areaTotalSqMi) and relates it to an- other resource (Brazil) through the property country. Namespaces have been omitted for readability........................2 2.1 Key terms used in the literature review.....................8 3.1 Workflow of a Knowledge Base Tagging (KBT) task............. 22 3.2 Interactions between actors and a KBT system................ 23 3.3 Typical internal workflow of a KBT system.................. 30 4.1 A snippet of the triples describing dbpedia:Rio_de_Janeiro..... 37 4.2 Linking Open Data Cloud diagram [Jentzsch et al., 2011]........... 42 4.3 SPARQL query demonstrating how to retrieve entities and concepts under a certain category................................ 45 4.4 SPARQL query demonstrating how to retrieve pages linking to topical con- cepts...................................... 46 4.5 A snippet of the Grammatical Gender Data Set................ 46 4.6 A snippet of the Topic Signatures Data Set.................. 48 5.1 Model objects in DBpedia Spotlight’s core model............... 51 5.2 Example phrases recognized externally and encoded as SpotXml....... 55 5.3 Results for a request to /related for the top-ranked URIs by related- ness to dbpedia:Adidas. Each key-value pair represents a URI and the relatedness score (TF*IDF) between that URI and Adidas.......... 59 5.4 DBpedia Spotlight Web Application...................... 62 5.5 Example call to the Web Service using cURL................. 63 5.6 Example XML fragment resulting from the annotation service........ 64 5.7 Round-trip Semantics............................. 67 3 3+ 6.1 Comparison of B and B F1 scores for NIL and Non-NIL