Information Extraction Meets the Semantic Web: a Survey
Total Page:16
File Type:pdf, Size:1020Kb
Undefined 0 (2016) 1 1 IOS Press Information Extraction meets the Semantic Web: A Survey Editor(s): Andreas Hotho, Julius-Maximilians-Universität Würzburg, Würzburg, Germany Solicited review(s): Dat Ba Nguyen, Vietnam National University, Hanoi, Vietnam; Simon Scerri, University of Bonn, Germany; 2 Anonymous Reviewers Open review(s): Michelle Cheatham, Wright State University, Dayton, Ohio, USA Jose L. Martinez-Rodriguez a, Aidan Hogan b and Ivan Lopez-Arevalo a a Cinvestav Tamaulipas, Ciudad Victoria, Mexico E-mail: {lmartinez,ilopez}@tamps.cinvestav.mx b IMFD Chile; Department of Computer Science, University of Chile, Chile E-mail: [email protected] Abstract. We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Se- mantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Seman- tic Web resources (languages/ontologies/knowledge-bases/tools) to improve Information Extraction, and/or using Information Extraction to populate the Semantic Web. In more detail, we focus on the extraction and linking of three elements: entities, concepts and relations. Extraction involves identifying (textual) mentions referring to such elements in a given unstructured or semi-structured input source. Linking involves associating each such mention with an appropriate disambiguated identifier referring to the same element in a Semantic Web knowledge-base (or ontology), in some cases creating a new identifier where necessary. With respect to entities, works involving (Named) Entity Recognition, Entity Disambiguation, Entity Linking, etc. in the context of the Semantic Web are considered. With respect to concepts, works involving Terminology Extraction, Keyword Extraction, Topic Modeling, Topic Labeling, etc., in the context of the Semantic Web are considered. Finally, with respect to relations, works involving Relation Extraction in the context of the Semantic Web are considered. The focus of the majority of the survey is on works applied to unstructured sources (text in natural language); however, we also provide an overview of works that develop custom techniques adapted for semi-structured inputs, namely markup documents and web tables. Keywords: Information Extraction, Entity Linking, Keyword Extraction, Topic Modeling, Relation Extraction, Semantic Web 1. Introduction able on the Web – information that is constantly chang- ing – for it to be feasible to apply manual annotation to The Semantic Web pursues a vision of the Web even a significant subset of what might be of relevance. where increased availability of structured content en- While the amount of structured data available on ables higher levels of automation. Berners-Lee [20] the Web has grown significantly in the past years, described this goal as being to “enrich human read- there is still a significant gap between the coverage able web data with machine readable annotations, al- of structured and unstructured data available on the lowing the Web’s evolution as the biggest database in Web [248]. Mika referred to this as the semantic the world”. However, making annotations on informa- gap [205], whereby the demand for structured data on tion from the Web is a non-trivial task for human users, the Web outstrips its supply. For example, in an anal- particularly if some formal agreement is required to ysis of the 2013 Common Crawl dataset, Meusel et ensure that annotations are consistent across sources. al. [201] found that of the 2.2 billion webpages con- Likewise, there is simply too much information avail- sidered, 26.3% contained some structured metadata. 0000-0000/16/$00.00 © 2016 – IOS Press and the authors. All rights reserved 2 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web Thus, despite initiatives like Linking Open Data [274], language that is founded on one of the core Semantic Schema.org [200,204] (promoted by Google, Mi- Web standards: RDF/RDFS/OWL/SKOS/SPARQL.2 crosoft, Yahoo, and Yandex) and the Open Graph Pro- By Information Extraction methods, we focus on the tocol [127] (promoted by Facebook), this semantic gap extraction and/or linking of three main elements from is still observable on the Web today [205,201]. an (unstructured or semi-structured) input source. As a result, methods to automatically extract or en- hance the structure of various corpora have been a core 1. Entities: anything with named identity, typically topic in the context of the Semantic Web. Such pro- an individual (e.g., Barack Obama, 1961). cesses are often based on Information Extraction meth- 2. Concepts: a conceptual grouping of elements. We ods, which in turn are rooted in techniques from areas consider two types of concepts: such as Natural Language Processing, Machine Learn- – Classes: a named set of individuals (e.g., ing and Information Retrieval. The combination of U.S. President(s)); techniques from the Semantic Web and from Informa- – Topics: categories to which individuals or tion Extraction can be seen from two perspectives: on documents relate (e.g, U.S. Politics). the one hand, Information Extraction techniques can 3. Relations: an n-ary tuple of entities (n ≥ 2) with a be applied to populate the Semantic Web, while on the predicate term denoting the type of relation (e.g., other hand, Semantic Web techniques can be applied marry(Barack Obama;Michele Obama;Chicago). to guide the Information Extraction process. In some cases, both aspects are considered together, where an More formally, we can consider entities as atomic el- existing Semantic Web ontology or knowledge-base is ements from the domain, concepts as unary predi- used to guide the extraction, which further populates cates, and relations as n-ary (n ≥ 2) predicates. We the given ontology and/or knowledge-base (KB).1 take a rather liberal interpretation of concepts to in- In the past years, we have seen a wealth of research clude both classes based on set-theoretic subsumption dedicated to Information Extraction in a Semantic Web of instances (e.g., OWL classes [135]), as well as top- setting. While many such papers come from within ics that form categories over which broader/narrower the Semantic Web community, many recent works relations can be defined (e.g., SKOS concepts [206]). have come from other communities, where, in partic- This is rather a practical decision that will allow us to ular, general-knowledge Semantic Web KBs – such draw together a collective summary of works in the in- as DBpedia [170], Freebase [26] and YAGO2 [138]– terrelated areas of Terminology Extraction, Keyword have been broadly adopted as references for enhanc- Extraction, Topic Modeling, etc., under one heading. ing Information Extraction tasks. Given the wide va- Returning to “extracting and/or linking”, we con- riety of works emerging in this particular intersection sider the extraction process as identifying mentions re- from various communities (sometimes under different ferring to such entities/concepts/relations in the un- nomenclatures), we see that a comprehensive survey structured or semi-structured input, while we consider is needed to draw together the techniques proposed in the linking process as associating a disambiguated such works. Our goal is then to provide such a survey. identifier in a Semantic Web ontology/KB for a men- tion, possibly creating one if not already present and Survey Scope: This survey provides an overview of using it to disambiguate and link further mentions. published works that directly involve both Information Extraction methods and Semantic Web technologies. Information Extraction Tasks: The survey deals with Given that both are very broad areas, we must be rather various Information Extraction tasks. We now give an explicit in our inclusion criteria. introductory summary of the main tasks considered With respect to Semantic Web technologies, to be (though we note that the survey will delve into each included in the scope of a survey, a work must make task in much more depth later): non-trivial use of an ontology, knowledge-base, tool or Named Entity Recognition: demarcate the locations of mentions of entities in an input text: 1Herein we adopt the convention that the term “ontology” refers – aka. Entity Recognition, Entity Extraction; primarily to terminological knowledge, meaning that it describes classes and properties of the domain, such as person, knows, coun- try, etc. On the other hand, we use the term “KB” to refer to primar- 2Works that simply mention general terms such as “semantic” or ily “assertional knowledge”, which describes specific entities (aka. “ontology” may be excluded by this criteria if they do not also di- individuals) of the domain, such as Barack Obama, China, etc. rectly use or depend upon a Semantic Web standard. J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 3 – e.g., in the sentence “Barack Obama was Topic Labeling: For clusters of words identified as born in Hawaii”, mark the underlined abstract topics, extract a single term or phrase that phrases as entity mentions. best characterizes the topic; Entity Linking: associate mentions of entities with – aka. Topic Identification, esp. when linked an appropriate disambiguated KB identifier: with an ontology/KB identifier; often used – involves, or is sometimes synonymous with, for the purposes of Text Classification; Entity Disambiguation;3 often used for the – e.g., identify that the topic f “cancer”, purposes of Semantic Annotation; “breast”, “doctor”, “chemotherapy” g is – e.g., associate “Hawaii” with the DBpedia best characterized with the term “cancer” identifier dbr:Hawaii for the U.S. state (potentially linked to dbr:Cancer for the (rather than the identifier for various songs disease and not, e.g., the astrological sign). 4 or books by the same name). Relation Extraction: Extract potentially n-ary rela- Terminology Extraction: extract the main phrases tions (for n ≥ 2) from an unstructured (i.e., text) that denote concepts relevant to a given domain or semi-structured (e.g., HTML table) source; described by a corpus, sometimes inducing hier- – a goal of the area of Open Information Ex- archical relations between concepts; traction; – aka.