Undefined 0 (2016) 1 1 IOS Press
Information Extraction meets the Semantic Web: A Survey
Editor(s): Andreas Hotho, Julius-Maximilians-Universität Würzburg, Würzburg, Germany Solicited review(s): Dat Ba Nguyen, Vietnam National University, Hanoi, Vietnam; Simon Scerri, University of Bonn, Germany; 2 Anonymous Reviewers Open review(s): Michelle Cheatham, Wright State University, Dayton, Ohio, USA
Jose L. Martinez-Rodriguez a, Aidan Hogan b and Ivan Lopez-Arevalo a a Cinvestav Tamaulipas, Ciudad Victoria, Mexico E-mail: {lmartinez,ilopez}@tamps.cinvestav.mx b IMFD Chile; Department of Computer Science, University of Chile, Chile E-mail: [email protected]
Abstract. We provide a comprehensive survey of the research literature that applies Information Extraction techniques in a Se- mantic Web setting. Works in the intersection of these two areas can be seen from two overlapping perspectives: using Seman- tic Web resources (languages/ontologies/knowledge-bases/tools) to improve Information Extraction, and/or using Information Extraction to populate the Semantic Web. In more detail, we focus on the extraction and linking of three elements: entities, concepts and relations. Extraction involves identifying (textual) mentions referring to such elements in a given unstructured or semi-structured input source. Linking involves associating each such mention with an appropriate disambiguated identifier referring to the same element in a Semantic Web knowledge-base (or ontology), in some cases creating a new identifier where necessary. With respect to entities, works involving (Named) Entity Recognition, Entity Disambiguation, Entity Linking, etc. in the context of the Semantic Web are considered. With respect to concepts, works involving Terminology Extraction, Keyword Extraction, Topic Modeling, Topic Labeling, etc., in the context of the Semantic Web are considered. Finally, with respect to relations, works involving Relation Extraction in the context of the Semantic Web are considered. The focus of the majority of the survey is on works applied to unstructured sources (text in natural language); however, we also provide an overview of works that develop custom techniques adapted for semi-structured inputs, namely markup documents and web tables.
Keywords: Information Extraction, Entity Linking, Keyword Extraction, Topic Modeling, Relation Extraction, Semantic Web
1. Introduction able on the Web – information that is constantly chang- ing – for it to be feasible to apply manual annotation to The Semantic Web pursues a vision of the Web even a significant subset of what might be of relevance. where increased availability of structured content en- While the amount of structured data available on ables higher levels of automation. Berners-Lee [20] the Web has grown significantly in the past years, described this goal as being to “enrich human read- there is still a significant gap between the coverage able web data with machine readable annotations, al- of structured and unstructured data available on the lowing the Web’s evolution as the biggest database in Web [248]. Mika referred to this as the semantic the world”. However, making annotations on informa- gap [205], whereby the demand for structured data on tion from the Web is a non-trivial task for human users, the Web outstrips its supply. For example, in an anal- particularly if some formal agreement is required to ysis of the 2013 Common Crawl dataset, Meusel et ensure that annotations are consistent across sources. al. [201] found that of the 2.2 billion webpages con- Likewise, there is simply too much information avail- sidered, 26.3% contained some structured metadata.
0000-0000/16/$00.00 © 2016 – IOS Press and the authors. All rights reserved 2 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
Thus, despite initiatives like Linking Open Data [274], language that is founded on one of the core Semantic Schema.org [200,204] (promoted by Google, Mi- Web standards: RDF/RDFS/OWL/SKOS/SPARQL.2 crosoft, Yahoo, and Yandex) and the Open Graph Pro- By Information Extraction methods, we focus on the tocol [127] (promoted by Facebook), this semantic gap extraction and/or linking of three main elements from is still observable on the Web today [205,201]. an (unstructured or semi-structured) input source. As a result, methods to automatically extract or en- hance the structure of various corpora have been a core 1. Entities: anything with named identity, typically topic in the context of the Semantic Web. Such pro- an individual (e.g., Barack Obama, 1961). cesses are often based on Information Extraction meth- 2. Concepts: a conceptual grouping of elements. We ods, which in turn are rooted in techniques from areas consider two types of concepts: such as Natural Language Processing, Machine Learn- – Classes: a named set of individuals (e.g., ing and Information Retrieval. The combination of U.S. President(s)); techniques from the Semantic Web and from Informa- – Topics: categories to which individuals or tion Extraction can be seen from two perspectives: on documents relate (e.g, U.S. Politics). the one hand, Information Extraction techniques can 3. Relations: an n-ary tuple of entities (n ≥ 2) with a be applied to populate the Semantic Web, while on the predicate term denoting the type of relation (e.g., other hand, Semantic Web techniques can be applied marry(Barack Obama,Michele Obama,Chicago). to guide the Information Extraction process. In some cases, both aspects are considered together, where an More formally, we can consider entities as atomic el- existing Semantic Web ontology or knowledge-base is ements from the domain, concepts as unary predi- used to guide the extraction, which further populates cates, and relations as n-ary (n ≥ 2) predicates. We the given ontology and/or knowledge-base (KB).1 take a rather liberal interpretation of concepts to in- In the past years, we have seen a wealth of research clude both classes based on set-theoretic subsumption dedicated to Information Extraction in a Semantic Web of instances (e.g., OWL classes [135]), as well as top- setting. While many such papers come from within ics that form categories over which broader/narrower the Semantic Web community, many recent works relations can be defined (e.g., SKOS concepts [206]). have come from other communities, where, in partic- This is rather a practical decision that will allow us to ular, general-knowledge Semantic Web KBs – such draw together a collective summary of works in the in- as DBpedia [170], Freebase [26] and YAGO2 [138]– terrelated areas of Terminology Extraction, Keyword have been broadly adopted as references for enhanc- Extraction, Topic Modeling, etc., under one heading. ing Information Extraction tasks. Given the wide va- Returning to “extracting and/or linking”, we con- riety of works emerging in this particular intersection sider the extraction process as identifying mentions re- from various communities (sometimes under different ferring to such entities/concepts/relations in the un- nomenclatures), we see that a comprehensive survey structured or semi-structured input, while we consider is needed to draw together the techniques proposed in the linking process as associating a disambiguated such works. Our goal is then to provide such a survey. identifier in a Semantic Web ontology/KB for a men- tion, possibly creating one if not already present and Survey Scope: This survey provides an overview of using it to disambiguate and link further mentions. published works that directly involve both Information Extraction methods and Semantic Web technologies. Information Extraction Tasks: The survey deals with Given that both are very broad areas, we must be rather various Information Extraction tasks. We now give an explicit in our inclusion criteria. introductory summary of the main tasks considered With respect to Semantic Web technologies, to be (though we note that the survey will delve into each included in the scope of a survey, a work must make task in much more depth later): non-trivial use of an ontology, knowledge-base, tool or Named Entity Recognition: demarcate the locations of mentions of entities in an input text: 1Herein we adopt the convention that the term “ontology” refers – aka. Entity Recognition, Entity Extraction; primarily to terminological knowledge, meaning that it describes classes and properties of the domain, such as person, knows, coun- try, etc. On the other hand, we use the term “KB” to refer to primar- 2Works that simply mention general terms such as “semantic” or ily “assertional knowledge”, which describes specific entities (aka. “ontology” may be excluded by this criteria if they do not also di- individuals) of the domain, such as Barack Obama, China, etc. rectly use or depend upon a Semantic Web standard. J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 3
– e.g., in the sentence “Barack Obama was Topic Labeling: For clusters of words identified as born in Hawaii”, mark the underlined abstract topics, extract a single term or phrase that phrases as entity mentions. best characterizes the topic; Entity Linking: associate mentions of entities with – aka. Topic Identification, esp. when linked an appropriate disambiguated KB identifier: with an ontology/KB identifier; often used – involves, or is sometimes synonymous with, for the purposes of Text Classification; Entity Disambiguation;3 often used for the – e.g., identify that the topic { “cancer”, purposes of Semantic Annotation; “breast”, “doctor”, “chemotherapy” } is – e.g., associate “Hawaii” with the DBpedia best characterized with the term “cancer” identifier dbr:Hawaii for the U.S. state (potentially linked to dbr:Cancer for the (rather than the identifier for various songs disease and not, e.g., the astrological sign). 4 or books by the same name). Relation Extraction: Extract potentially n-ary rela- Terminology Extraction: extract the main phrases tions (for n ≥ 2) from an unstructured (i.e., text) that denote concepts relevant to a given domain or semi-structured (e.g., HTML table) source; described by a corpus, sometimes inducing hier- – a goal of the area of Open Information Ex- archical relations between concepts; traction; – aka. Term Extraction, often used for the pur- – e.g., in the sentence “Barack Obama was poses of Ontology Learning; born in Hawaii – e.g., identify from a text on Oncology that ”, extract the binary rela- wasBornIn(Barack Obama,Hawaii) “breast cancer” and “melanoma” are im- tion ; portant concepts in the domain; – binary relations may be represented as RDF – optionally identify that both of the above triples after linking entities and linking the concepts are specializations of “cancer”; predicate to an appropriate property (e.g., – terms may be linked to a KB/ontology. mapping wasBornIn to the DBpedia prop- Keyphrase Extraction: extract the main phrases that erty dbo:birthPlace); categorize the subject/domain of a text (unlike – n-ary (n ≥ 3) relations are often represented Terminology Extraction, the focus is often on de- with a variant of reification [133,271]. scribing the document, not the domain); Note that we will use a more simplified nomenclature – aka. Keyword Extraction, which is often {Entity,Concept,Relation} × {Extraction,Linking} generically applied to cover extraction of as previously described to structure our survey with the multi-word phrases; often used for the pur- goal of grouping related works together; thus, works poses of Semantic Annotation; on Terminology Extraction, Keyphrase Extraction, – e.g., identify that the keyphrases “breast Topic Modeling and Topic Labeling will be grouped cancer” and “mammogram” help to summa- under the heading of Concept Extraction and Linking. rize the subject of a particular document; – keyphrases may be linked to a KB/ontology. Again we are only interested in such tasks in the Topic Modeling: Cluster words/phrases frequently context of the Semantic Web. Our focus is on unstruc- co-occurring together in the same context; these tured (text) inputs, but we will also give an overview clusters are then interpreted as being associated of methods for semi-structured inputs (markup docu- to abstract topics to which a text relates; ments and tables) towards the end of the survey. – aka. Topic Extraction, Topic Classification; Related Areas, Surveys and Novelty: There are a va- – e.g., identify that words such as “cancer”, riety of areas that relate and overlap with the scope of “breast”, “doctor”, “chemotherapy” tend this survey, and likewise there have been a number of to co-occur frequently and thus conclude previous surveys in these areas. We now discuss such that a document containing many such oc- areas and surveys, how they relate to the current con- currences is about a particular abstract topic. tribution, and outline the novelty of the current survey. As we will see throughout this survey, Information 3In some cases Entity Linking is considered to include both recog- Extraction (IE) from unstructured sources – i.e., tex- nition and disambiguation; in other cases, it is considered synony- tual corpora expressed primarily in natural language – mous with disambiguation applied after recognition. 4We use well-known IRI prefixes as consistent with the lookup relies heavily on Natural Language Processing (NLP). service hosted at: http://prefix.cc. All URLs in this paper were A number of resources have been published within last accessed on 2018/05/30. the intersection of NLP and the Semantic Web (SW), 4 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web where we can point, for example, to a recent book pub- Ontology-Based Information Extraction: refers to lished by Maynard et al. [190] in 2016, which likewise leveraging the formal knowledge of ontologies covers topics relating to IE. However, while IE tools to guide a traditional Information Extraction pro- may often depend on NLP processing techniques, this cess over unstructured corpora. Such works fall is not always the case, where many modern approaches within the scope of this survey. A prior survey of to tasks such as Entity Linking do not use a traditional Ontology-Based Information Extraction was pub- NLP processing pipeline. Furthermore, unlike the in- lished by Wimalasuriya and Dou [313] in 2010. troductory textbook by Maynard et al. [190], our goal Ontology Learning: helps automate the (costly) pro- here is to provide a comprehensive survey of the re- cess of ontology building by inducing an (initial) search works in the area. Note that we also provide a ontology from a domain-specific corpus. Ontol- brief primer on the most important NLP techniques in ogy Learning also often includes Ontology Popu- a supplementary appendix, discussed later. lation, meaning that instance of concepts and re- On the other hand, Data Mining involves extract- lations are also extracted. Such works fall within ing patterns inherent in a dataset. Example Data Min- our scope. A survey of Ontology Learning was ing tasks include classification, clustering, rule min- provided by Wong et al. [316] in 2012. ing, predictive analysis, outlier detection, recommen- Knowledge Extraction: aims to lift an unstructured dation, etc. Knowledge Discovery refers to a higher- or semi-structured corpus into an output de- level process to help users extract knowledge from scribed using a knowledge representation for- raw data, where a typical pipeline involves selection of malism (such as OWL). Thus Knowledge Ex- data, pre-processing and transformation of data, a Data traction can be seen as Information Extraction Mining phase to extract patterns, and finally evaluation but with a stronger focus on using knowledge and visualization to aid users gain knowledge from the representation techniques to model outputs. In raw data and provide feedback. Some IE techniques 2013, Gangemi [110] provided an introduction may rely on extracting patterns from data, which can and comparison of fourteen tools for Knowledge be seen as a Data Mining step5; however, Information Extraction over unstructured corpora. Extraction need not use Data Mining techniques, and Other related terms such as “Semantic Informa- many Data Mining tasks – such as outlier detection tion Extraction” [108], “Knowledge-Based Informa- – have only a tenuous relation to Information Extrac- tion Extraction” [139], “Knowledge-Graph Comple- tion. A survey of approaches that combine Data Min- tion” [178], and so forth, have also appeared in the ing/Knowledge Discovery with the Semantic Web was literature. However, many such titles are used specif- published by Ristoski and Paulheim [261] in 2016. ically within a given community, whereas works in With respect to our survey, both Natural Language the intersection of IE and SW have appeared in many Processing and Data Mining form part of the back- communities. For example, “Knowledge Extraction” ground of our scope, but as discussed, Information Ex- is used predominantly by the SW community and not traction has a rather different focus to both areas, nei- others.6 Hence our survey can be seen as drawing to- ther covering nor being covered by either. gether works in such (sub-)areas under a more general On the other hand, relating more specifically to the scope: works involving IE techniques in a SW setting. intersection of Information Extraction and the Seman- Intended Audience: This survey is written for re- tic Web, we can identify the following (sub-)areas: searchers and practitioners who are already quite fa- Semantic Annotation: aims to annotate documents miliar with the main SW standards and concepts – such with entities, classes, topics or facts, typically as the RDF, RDFS, OWL and SPARQL standards, etc. based on an existing ontology/KB. Some works – but are not necessarily familiar with IE techniques. on Semantic Annotation fall within the scope of Hence we will not introduce SW concepts (such as our survey as they include extraction and linking RDF, OWL, etc.) herein. Otherwise, our goal is to of entities and/or concepts (though not typically make the survey as accessible as possible. For exam- relations). A survey focused on Semantic Anno- ple, in order to make the survey self-contained, in Ap- tation was published by Uren et al. [300] in 2006.
6Here we mean “Knowledge Extraction” in an IE-related context. 5In fact, the title “Information Extraction” pre-dates that of the Other works on generating explanations from neural networks use title “Data Mining” in its modern interpretation. the same term in an unrelated manner. J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 5 pendixA we provide a detailed primer on some tradi- Scholar for related papers, merging and deduplicating tional NLP and IE processes; the techniques discussed lists of candidate papers (numbering in the thousands in this appendix are, in general, not in the scope of the in total); (II) we initially apply a rough filter for rele- survey, since they do not involve SW resources, but are vance based on the title and type of publication; (III) heavily used by works that fall in scope. We recom- we filter for relevance by abstract; and (IV) finally we mend readers unfamiliar with the IE area to read the filter for relevance by the body of the paper. appendix as a primer prior to proceeding to the main To collect further literature, while reading relevant body of the survey. Knowledge of some core Informa- papers, we also take note of other works referenced in tion Retrieval concepts – such as TF–IDF, PageRank, related works, works that cite more prominent relevant cosine similarity, etc. – and some core Machine Learn- papers, and also check the bibliography of prominent ing concepts – such as logistic regression, SVM, neural authors in the area for other papers that they have writ- networks, etc. – may be necessary to understand finer ten; such works were added in phase III to be later fil- details, but not to understand the main concepts. tered in phase IV. Table2 presents the numbers of pa- pers considered by each phase of the methodology.7 Nomenclature: The area of Information Extraction is associated with a diverse nomenclature that may vary Table 1 in use and connotation from author to author. Such Keywords used to search for candidate papers variations may at times be subtle and at other times be E/C/R list keywords relating to Entities, Concepts and Relations; entirely incompatible. Part of this relates to the vari- SW lists keywords relating to the Semantic Web ous areas in which Information Extraction has been ap- Type Keyword set plied and the variety of areas from which it draws in- E "coreference resolution", "entity disambiguation", fluence. We will attempt to use generalized terminol- "entity linking", "entity recognition", ogy and indicate when terminology varies. "entity resolution", "named entity", "semantic annotation" Survey Methodology: Based on the previous discus- C "concept models", "glossary extraction", sion, this survey includes papers that: "group detection", "keyphrase assignment", "keyphrase extraction", "keyphrase recognition", – deal with extraction and/or linking of entities, "keyword assignment", "keyword extraction", concepts and/or relations, "keyword recognition", "latent variable models", "LDA" "LSA", "pLSA", "term extraction", – deal with some Semantic Web standard – namely "term recognition", "terminology mining", RDF, RDFS or OWL – or a resource published or "topic extraction", "topic identification", "topic modeling" otherwise using those standards, – have details published, in English, in a relevant R "OpenIE", "open information extraction", "open knowledge extraction", "relation detection", workshop, conference or journal since 1999, "relation extraction", "semantic relation" – consider extraction from unstructured sources. SW "linked data", "ontology", "OWL", "RDF", "RDFS", For finding in-scope papers, our methodology be- "semantic web", "SPARQL", "web of data" gins with a definition of keyphrases appropriate to the section at hand. These keyphrases are divided into We provide further details of our survey online, in- lists of IE-related terms (e.g., “entity extraction”, cluding the lists of papers considered by each phase.8 “entity linking”) and SW-related terms (e.g., We may include out-of-scope papers to the extent “ontology”, “linked data”), where we apply a con- that they serve as important background for the in- junction of their products to create search phrases (e.g., scope papers: for example, it is important for an unini- “entity extraction ontology”). Given the diverse tiated reader to understand some of the core tech- terminology used in different communities, often we niques considered in the traditional Information Ex- need to try many variants of keyphrases to capture traction area and to understand some of the core stan- as many papers as possible. Table1 lists the base dards and resources considered in the core Semantic keyphrases used to search for papers; the final keyword Web area. Furthermore, though not part of the main searches are given by the set (E ∪ C ∪ R)kSW, where “k” denotes concatenation (with a delimiting space). 7Table2 refers to papers considering text as input; a further 20 Our survey methodology consists of four initial papers considering semi-structured inputs are presented later in the phases to search, extract and filter papers. For each de- survey, which will bring the total to 109 selected papers. fined keyphrase, we (I) perform a search on Google 8http://www.tamps.cinvestav.mx/~lmartinez/survey/ 6 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
Table 2 on the other hand, the reference KB may contain enti- Number of papers included in the survey (by Phase) ties from hundreds of types. Hence, while some Entity E/C/R denote counts of highlighted papers in this survey relating to Extraction & Linking tools rely on off-the-shelf NER Entities, Concepts, and Relations, resp.; Σ denotes the sum of E + C + R by phase. tools, others define bespoke methods for identifying entity mentions in text, typically using entities’ labels Phase E C R Σ in the KB as a dictionary to guide the extraction. Seed collection (I) 2,418 8,666 8,008 19,092 Once entity mentions are extracted from the text, Filter by title (II) 114 167 148 429 the next phase involves linking – or disambiguating – Filter by abstract (III) 100 115 102 317 these mentions by assigning them to KB identifiers; Final list (IV) 25 36 28 89 typically each mention in the text is associated with a single KB identifier chosen by the process as the most likely match, or is associated with multiple KB identi- survey, in Section5, we provide a brief overview of fiers and an associated weight (aka. support) indicating otherwise related papers that consider semi-structured confidence in the matches that allow the application to input sources, such as markup documents, tables, etc. choose which entity links to trust. Survey Structure: The structure of the remainder of Example: In Listing1, we provide an excerpt of this survey is as follows: an EEL response given by the online DBpedia Spot- Section2 discusses extraction and linking of entities light demo10 in JSON format. Within the result, the for unstructured sources. “@URI” attribute is the selected identifier obtained Section3 discusses extraction and linking of concepts from DBpedia, the “@support” is a degree of confi- for unstructured sources. dence in the match, the “@types” list matches classes Section4 discusses extraction and linking of relations from the KB, the “@surfaceForm” represents the for unstructured sources. text of the entity mention, the “@offset” indicates Section5 discusses techniques adapted specifically the character position of the mention in the text, for extracting entities, concepts and/or relations the “@similarityScore” indicates the strength of from semi-structured sources. a match with the entity label in the KB, and the Section6 concludes the survey with a discussion. “@percentageOfSecondRank” indicates the ratio of Additionally, AppendixA provides a primer on clas- the support computed for the first- and second-ranked sical Information Extraction techniques for readers documents thus indicating the level of ambiguity. previously unfamiliar with the IE area; we recommend Of course, the exact details of the output of an EEL such a reader to review this material before continuing. process will vary from tool to tool, but such a tool will minimally return a KB identifier and the location of the entity mention; a support will also often be returned. 2. Entity Extraction & Linking Applications: EEL is used in a variety of applica- Entity Extraction & Linking (EEL)9 refers to identi- tions, such as semantic annotation [41], where entities fying mentions of entities in a text and linking them to mentioned in text can be further detailed with refer- a reference KB provided as input. ence data from the KB; semantic search [295], where Entity Extraction can be performed using an off-the- search over textual collections can be enhanced – for shelf Named Entity Recognition (NER) tool as used in example, to disambiguate entities or to find categories traditional IE scenarios (see Appendix A.1); however of relevant entities – through the structure provided such tools typically extract entities for limited numbers by the KB; question answering [299], where the input of types, such as persons, organizations, places, etc.; text is a user question and the EEL process can iden- tify which entities in the KB the question refers to; fo- cused archival [79], where the goal is to collect and 9We note that naming conventions can vary widely: sometimes Named Entity Linking (NEL) is used; sometimes the acronym preserve documents relating to particular entities; de- (N)ERD is used for (Named) Entity Recognition & Disambiguation; tecting emerging entities [136], where entities that do sometimes EEL is used as a synonym for NED; other phrases can not yet appear in the KB, but may be candidates for also be used, such as Named Entity Extraction (NEE), or Named En- tity Resolution, or variations on the idea of semantic annotation or semantic tagging (which we consider applications of EEL). 10http://dbpedia-spotlight.github.io/demo/ J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 7
adding to the KB, are extracted.11 EEL can also serve Listing 1: DBpedia Spotlight EEL example as the basis for later IE processes, such as Topic Mod- Input: Bryan Cranston is an American actor. He is eling, Relation Extraction, etc., as discussed later. ,→ known for portraying "Walter White" in the ,→ drama series Breaking Bad. Process: As stated by various authors [62,153,243, Output: 246], the EEL process is typically composed of two { main steps: recognition, where relevant entity men- "@text": "Bryan Cranston is an American actor. He ,→ is known for portraying \"Walter White\" tions in the text are found; and disambiguation, where ,→ in the drama series Breaking Bad.", entity mentions are mapped to candidate identifiers "@confidence": "0.35", "@support": "0", with a final weighted confidence. Since these steps are "@types": "", (often) loosely coupled, this section surveys the vari- "@sparql": "", "@policy": "whitelist", ous techniques proposed for the recognition task and "Resources": [ thereafter discusses disambiguation. { "@URI": "http://dbpedia.org/resource/ ,→ Bryan_Cranston", "@support": "199", "@types": "DBpedia:Agent,Schema:Person,Http:// 2.1. Recognition ,→ xmlns.com/foaf/0.1/Person,DBpedia:Person ,→ ", "@surfaceForm": "Bryan Cranston", The goal of EEL is to extract and link entity men- "@offset": "0", tions in a text with entity identifiers in a KB; some "@similarityScore": "1.0", "@percentageOfSecondRank": "0.0" }, tools may additionally detect and propose identifiers { "@URI": "http://dbpedia.org/resource/ for emerging entities that are not yet found in the ,→ United_States", "@support": "560750", KB [238,247,231]. In both cases, the first step is to "@types": "Schema:Place,DBpedia:Place,DBpedia: mark entity mentions in the text that can be linked (or ,→ PopulatedPlace,Schema:Country,DBpedia: ,→ Country ", proposed as an addition) to the KB. Thus traditional "@surfaceForm": "American", NER tools – discussed in Appendix A.1 – can be used. "@offset": "21", "@similarityScore": "0.9940788480408", However, in the context of EEL where a target KB is "@percentageOfSecondRank": "0.003612999020" }, given as input, there can be key differences between a { "@URI": "http://dbpedia.org/resource/Actor", "@support": "35596", typical EEL recognition phase and traditional NER: "@types": "", "@surfaceForm": "actor", "@offset": "30", – In cases where emerging entities are not detected, "@similarityScore": "0.9999710345342", the KB can provide a full list of target entity la- "@percentageOfSecondRank": "2.433621943E−5" }, { "@URI": "http://dbpedia.org/resource/ bels, which can be stored in a dictionary that is ,→ Walter_White_(Breaking_Bad)", used to find mentions of those entities. While dic- "@support": "856", "@types": "DBpedia:Agent,Schema:Person,Http:// tionaries can be found in traditional NER sce- ,→ xmlns.com/foaf/0.1/Person,DBpedia:Person narios, these often refer to individual tokens that ,→ ,DBpedia:FictionalCharacter", "@surfaceForm": "Walter White", strongly indicate an entity of a given type, such as "@offset": "66", common first or family names, lists of places and "@similarityScore": "0.9999999999753", "@percentageOfSecondRank": "2.471061675E−11" }, companies, etc. On the other hand, in EEL sce- { "@URI": "http://dbpedia.org/resource/Drama", narios, the dictionary can be populated with com- "@support": "6217", "@types": "", plete entity labels from the KB for a wider range "@surfaceForm": "drama", of types; in scenarios not involving emerging en- "@offset": "87", "@similarityScore": "0.8446404328140", tities, this dictionary will be complete for the en- "@percentageOfSecondRank": "0.1565036704039" }, tities to recognize. Of course, this can lead to a { "@URI": "http://dbpedia.org/resource/ ,→ Breaking_Bad", very large dictionary, depending on the KB used. "@support": "638", – Relating to the previous point, (particularly) in "@types": "Schema:CreativeWork,DBpedia:Work, ,→ DBpedia:TelevisionShow", scenarios where a complete dictionary is avail- "@surfaceForm": "Breaking Bad", able, the line between extraction and linking can "@offset": "100", "@similarityScore": "1.0", become blurred since labels in the dictionary "@percentageOfSecondRank": "4.6189529850E−23" } from the KB will often be associated with KB ] } 11Emerging entities are also sometimes known as Out-Of Knowledge-Base (OOKB) entities or Not In Lexicon (NIL) entities. 8 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
identifiers; hence, dictionary-based detection of Selection of entities: In the context of EEL, an obvi- entities will also provide initial links to the KB. ous source from which to form the dictionary is the la- Such approaches are sometimes known as End- bels of target entities in the KB. In many Information to-End (E2E) approaches [247], where extraction Extraction scenarios, KBs pertaining to general knowl- and linking phases become more tightly coupled. edge are employed; the most commonly used are: – In traditional NER scenarios, extracted entity DBpedia [170] A KB extracted from Wikipedia and mentions are typically associated with a type, used by ADEL [247], DBpedia Spotlight [199], usually with respect to a number of trained types ExPoSe [238], Kan-Dis [144], NERSO [124], such as person, organization, and location. How- Seznam [94], SDA [42] and THD [82], as well as ever, in many EEL scenarios, the types are al- works by Exner and Nugues [95], Nebhi [226], ready given by the KB and are in fact often much Giannini et al. [116], amongst others; richer than what traditional NER models support. Freebase [26] A collaboratively-edited KB – previ- ously hosted by Google but now discontinued in In this section, we thus begin by discussing the favor of Wikidata [293] – used by JERL [183], preparation of a dictionary and methods used for rec- Kan-Dis [144], NEMO [68], Neofonie [157], ognizing entities in the context of EEL. NereL [280], Seznam [94], Tulip [180], as well as works by Zheng et al. [330], amongst others; 2.1.1. Dictionary Wikidata [309] A collaboratively-edited KB hosted The predominant method for performing EEL re- by the Wikimedia Foundation that, although re- lies on using a dictionary – also known as a lexicon or leased more recently than other KBs, has been gazetteer – which maps labels of target entities in the used by HERD [283]; KB to their identifiers; for example, a dictionary might YAGO(2) [138] Another KB extracted from Wikipedia map the label “Bryan Cranston” to the DBpedia IRI with richer meta-data, used by AIDA [139], dbr:Bryan_Cranston. In fact, a single KB entity AIDA-Light [230], CohELL [121], J-NERD [231], may have multiple labels (aka. aliases) that map to one KORE [137] and LINDEN [277], as well as identifier, such as “Bryan Cranston”, “Bryan Lee works by Abedini et al. [1], amongst others. Cranston”, “Bryan L. Cranston”, etc. Furthermore, some labels may be ambiguous, where a single label These KBs are tightly coupled with owl:sameAs may map to a set of identifiers; for example, “Boston” links establishing KB-level coreference and are also may map to dbr:Boston, dbr:Boston_(band), and tightly coupled with Wikipedia; this implies that once so forth. Hence a dictionary may map KB labels to entities are linked to one such KB, they can be tran- identifiers in a many-to-many fashion. Finally, for sitively linked to the other KBs mentioned, and vice each KB identifier, a dictionary may contain contex- versa. KBs that are tightly coupled with Wikipedia in tual features to help disambiguate entities in a later this manner are popular choices for EEL since they de- scribe a comprehensive set of entities that cover nu- stage; for example, context information may tell us that merous domains of general interest; furthermore, the dbr:Boston is typed as dbo:City in the KB, or that text of Wikipedia articles on such entities can form a known mentions of dbr:Boston in a text often have useful source of contextual information. words like “population” or “metropolitan” nearby. On the other hand, many of the entities in these Thus, with respect to dictionaries, the first impor- general-interest KBs may be irrelevant for certain ap- tant aspect is the selection of entities to consider (or, plication scenarios. Some systems support selecting a indeed, the source from which to extract a selection subset of entities from the KB to form the dictionary, of entities). The second important aspect – particularly potentially pertaining to a given domain or a selec- given large dictionaries and/or large corpora of text – tion of types. For example, DBpedia Spotlight [199] is the use of optimized indexes that allow for efficient can build a dictionary from the DBpedia entities re- matching of mentions with dictionary labels. The third turned as results for a given SPARQL query. Such a aspect to consider is the enrichment of each entity in pre-selection of relevant entities can help reduce ambi- the dictionary with contextual information to improve guity and tailor EEL for a given application. the precision of matches. We now discuss these three Conversely, in EEL scenarios targeting niche do- aspects of dictionaries in turn. mains not covered by Wikipedia and its related KBs, J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 9 custom KBs may be required. For example, for the pur- dictionary preloaded into the index, the text can then poses of supporting multilingual EEL, Babelfy [215] be streamed through the index to find (prefix) matches. constructs its own KB from a unification of Wikipedia, Phrases are typically indexed separately to allow both WordNet, and BabelNet. In the context of Microsoft word-level and phrase-level matching. This algorithm Research, JERL [183] uses a proprietary KB (Mi- is implemented by GATE [69] and LingPipe [38], with crosoft’s Satori) alongside Freebase. Other approaches the latter being used by DBpedia Spotlight [199]. make minimal assumptions about the KB used, where The main drawback of tries is that, for the match- earlier EEL approaches such as SemTag [80] and ing process to be performed efficiently, the dictio- KIM [250] only assume that KB entities are associated nary index must fit in memory, which may be pro- with labels (in experiments, SemTag [80] uses Stan- hibitive for very large dictionaries. For these reasons, ford’s TAP KB, while KIM [250] uses a custom KB the Lucene/Solr Tagger implements a more general fi- called KIMO). nite state transducer that also reuses suffixes and byte- encodings to reduce space [71]; this index is used by Dictionary matching and indexing: In order to match HERD [283] and Tulip [180] to store KB labels. mentions with the dictionary in an efficient manner – In other cases, rather than using traditional Informa- with O(1) or O(log(n)) lookup performance – opti- mized data structures are required, which depend on tion Extraction frameworks, some authors have pro- the form of matching employed. The need for effi- posed to implement custom indexing methods. To give some examples, KIM [250] uses a hash-based index ciency is particularly important for some of the KBs 12 previously mentioned, where the number of target en- over tokens in an entity mention ; AIDA-Light [230] tities involved can go into the millions. The size of the uses a Locality Sensitive Hashing (LSH) index to find input corpora is also an important consideration: while approximate matches in cases where an initial exact- slower (but potentially more accurate) matching algo- match lookup fails; and so forth. rithms can be tolerated for smaller inputs, such algo- Of course, the problem of indexing the dictionary rithms are impractical for larger input texts. is closely related to the problem of inverted indexing A major challenge is that desirable matches may not in Information Retrieval, where keywords are indexed be an exact match, but may rather only be captured against the documents that contain them. Such inverted by an approximate string-matching algorithm. While indexes have proven their scalability and efficiency in one could consider, for example, approximate match- Web search engines such as Google, Bing, etc., and ing based on regular expressions or edit distances, such likewise support simple forms of approximate match- measures do not lend themselves naturally to index- ing based on, for example, stemming or lemmatization, based approaches. Instead, for large dictionaries, or which pre-normalize document and query keywords. large input corpora, it may be necessary to trade re- Exploiting this natural link to Information Retrieval, call (i.e., the percentage of correct spots captured) for the ADEL [247], AGDISTIS [301], Kan-Dis [144], efficiency by using coarser matching methods. Like- TagMe [99] and WAT [243] systems use inverted- 13 14 wise, it is important to note that KBs such as DBpe- indexing schemes such as Lucene and Elastic . dia enumerate multiple “alias” labels for entities (ex- To manage the structured data associated with en- tracted from the redirect entries in Wikipedia), which tities, such as identifiers or contextual features, some if included in the dictionary, can help to improve recall tools use more relational-style data management sys- while using coarser matching methods. tems. For example, AIDA [139] uses the PostgreSQL A popular approach to index the dictionary is to relational database to retrieve entity candidates, while 15 use some variation on a prefix tree (aka. trie), such as ADEL [247] and Neofonie [157] use the Couchbase 16 used by the Aho–Corasick string-searching algorithm, and Redis NoSQL stores, respectively, to manage the which can find mentions of an input list of strings labels and meta-data of their dictionaries. within an input text in time linear to the combined size of the inputs and output. The main idea is to repre- 12This implementation was later integrated into GATE: https: sent the dictionary as a prefix tree where nodes refer //gate.ac.uk/sale/tao/splitch13.html 13 to letters, and transitions refer to sequences of letters http://lucene.apache.org/core/ 14https://www.elastic.co; note that ElasticSearch is in fact in a dictionary word; further transitions are put from based on Lucene failed matches (dead-ends) to the node representing 15http://www.couchbase.com the longest matching prefix in the dictionary. With the 16https://redis.io/ 10 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
Contextual features: Rather than being a flat map of that may not be available in other textual corpora, entity labels to (sets of) KB identifiers, dictionaries including disambiguation pages, redirections of often include contextual features to later help disam- aliases, category information, info-boxes, article biguate candidate links. Such contextual features may edit history, and so forth.17 be categorized as being structured or unstructured. We will further discuss how contextual features – Structured contextual features are those that can be stored as part of the dictionary – can be used for dis- extracted directly from a structured or semi-structured ambiguation later in this section. source. In the context of EEL, such features are often extracted from the reference KB itself. For example, 2.1.2. Spotting each entity in the dictionary can be associated with the We now assume a dictionary that maps labels (e.g., (labels of the) types of that entity, but also perhaps the “Bryan Cranston”, “Bryan Lee Cranston”, etc.) labels of the properties that are defined for it, or a count to a (set of) KB identifier(s) for the entity ques- of the number of triples it is associated with, or the en- tion (e.g„ “dbr:Bryan_Cranston”) and potentially tities it is related to, or its centrality (and thus “impor- some contextual information (e.g., often co-occurs tance”) in the graph-structure of the KB, and so forth. with “dbr:Breaking_Bad”, anchor text often uses the On the other hand, unstructured contextual features term “Heisenberg”, etc.). In the next step, we iden- are those that must instead be extracted from textual tify entity mentions in the input text. We refer to this corpora. In most cases, this will involving extracting process as spotting, where we survey key approaches. statistics and patterns from an external reference cor- Token-based: Given that entity mentions may con- pus that potentially has already had its entities labeled sist of multiple sequential words – aka. n-grams – the (and linked with the KB). Such features may capture brute-force option would be to send all n-grams in the patterns in text surrounding the mentions of an entity, input text to the dictionary, for n up to, say, the maxi- entities that are frequently mentioned close together, mum number of words found in a dictionary entry, or a patterns in the anchor-text of links to a page about that fixed parameter. We refer generically to these n-grams entity, in how many documents a particular entity is as tokens and to these methods for extracting n-grams mentioned, how many times it tends to be mentioned as tokenization. Sometimes these methods are referred in a particular document, and so forth; clearly such in- to as window-based spotting or recognition techniques. formation will not be available from the KB itself. A number of systems use such a form of tokeniza- A very common choice of text corpora for extract- tion. SemTag uses the TAP ontology for seeking en- ing both structured and unstructured contextual fea- tity mentions that match tokens from the input text. In tures is Wikipedia, whose use in this setting was – to AIDA-Light [230], AGDISTIS [301], Lupedia [203], the best of our knowledge – first proposed by Bunescu and NERSO [124], recognition uses sliding windows and Pasca [33], then later followed by many other over the text for varying-length n-grams. subsequent works [67,99,42,255,39,40,243,246]. The Although relatively straightforward, a fundamental widespread use of Wikipedia can be explained by the weakness with token-based methods relates to perfor- unique advantages it has for such tasks: mance: given a large text, the dictionary-lookup imple- mentation will have to be very efficient to deal with – The text in Wikipedia is primarily factual and the number of tokens a typical such process will gen- available in a variety of languages. erate, many of which will be irrelevant. While some – Wikipedia has broad coverage, with documents basic features, such as capitalization, can also be taken about entities in a variety of domains. into account to filter (some) tokens, still, not all men- – Articles in Wikipedia can be directly linked to the tions may have capitalization, and many irrelevant or entities they describe in various KBs, including incoherent entities can still be retrieved; for example, DBpedia, Freebase, Wikidata, YAGO(2), etc. by decomposing the text “New York City”, the sec- – Mentions of entities in Wikipedia often provide a ond bi-gram may produce York City in England as a link to the article about that entity, thus providing candidate, though (probably) irrelevant to the mention. labeled examples of entity mentions and associ- Such entities are known as overlapping entities, where ated examples of anchor text in various contexts. post-processing must be applied (discussed later). – Aside from the usual textual features such as term frequencies and co-occurrences, a variety 17Information from info-boxes, disambiguation, redirects and cat- of richer features are available from Wikipedia egories are also represented in a structured format in DBpedia. J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 11
POS-based: A natural way to try to improve upon chunks (see Listing9 for an example NP/noun phrase lexical tokenization methods in End-to-End systems is annotation) are particularly relevant. As an example, to try use some initial understanding of the grammati- the THD system [82] uses GATE’s rule-based Java An- cal role of words in the text, where POS-tags are used notation Patterns Engine (JAPE) [69,294], consisting in order to be more selective with respect to what to- of regular-expression–like patterns over sequences of kens are sent to be matched against the dictionary. POS tags; more specifically, to extract entity mentions, A first idea is to use POS-tags to quickly filter in- THD uses the JAPE pattern NNP+, which will cap- dividual words that are likely to be irrelevant, where ture sequences of one-or-more proper nouns. A sim- traditional NLP/IE libraries can be used in a prepro- ilar approach is taken by ExtraLink [34], which uses cessing step. For example, ADEL [247], AIDA [139], SProUT [86]’s XTDL rules – composed of regular- Babelfy [215] and WAT [243] use the Stanford POS- expressions over sequences of tokens typed with POS tagger to focus on extracting entity mentions from tags or dictionaries – to extract entity mentions. words tagged as NNP (proper noun, singular) and As discussed in AppendixA, machine learning NNPS (proper noun, plural). DBpedia Spotlight [199] methods have become increasingly popular in recent rather relies on LingPipe POS-tagging, where verbs, years for parsing and NER. Hoffert et al. [136] propose adjectives, adverbs, and prepositions from the input combining AIDA and YAGO2 with Stanford NER – text are disregarded from the process. using a pre-trained Conditional Random Fields (CRF) On the other hand, entity mentions may involve classifier – to identify emerging entities. Likewise, words that are not nouns and may be disregarded by the ADEL [247] and UDFS [76] also use Stanford NER, system; this is particularly common for entity types not while JERL [183] uses a custom unified CRF model usually considered by traditional NER tools, includ- that simultaneously performs extraction and linking. 18 ing titles of creative works like “Breaking Bad”. On the other hand, WAT [243] relies on OpenNLP’s Heuristics such as analysis of capitalization can be NER tool based on a Maximum Entropy model. Go- used in certain cases to prevent filtering useful words; ing one step further, J-NERD [231] uses the depen- however, in other cases where words are not capi- dency parse-tree (extracted using a Stanford parser), talized, the process will likely fail to recognize such where dependencies between nouns are used to create mentions unless further steps are taken. Along those a tree-based model for each sentence, which are then lines, to improve recall, Babelfy [215] first uses a POS- combined into a global model across sentences, which tagger to identify nouns that match substrings of entity in turn is fed into a subsequent approximate inference labels in the dictionary and then checks the surround- process based on Gibbs sampling. ing text of the noun to try to expand the entity mention One limitation of using machine-learning tech- captured (using a maximum window of five words). niques in this manner is that they must be trained on Parser-based: Rather than developing custom meth- a specific corpus. While Stanford NER and OpenNLP ods, one could consider using more traditional NER provide a set of pre-trained models, these tend to techniques to identify entity mentions in the text. Such only cover the traditional NER types of person, or- an approach could also be used, for example, to iden- ganization, location and perhaps one or two more (or tify emerging entities not mentioned in the KB. How- a generic miscellaneous type). On the other hand, a ever, while POS-tagging is generally quite efficient, KB such as DBpedia may contain thousands of entity applying a full constituency or dependency parse (aka. types, where off-the-shelf models would only cover deep parsing methods) might be too expensive for a fraction thereof. Custom models can, however, be large texts. On the other hand, recognizing entity men- trained using these frameworks given appropriately la- tions often does not require full parse trees. beled data, where for example ADEL [247] addition- As a trade-off, in traditional NER, shallow-parsing ally trains models to recognize professions, or where methods are often applied: such methods annotate UDFS [76] trains for ten types on a Twitter dataset, an initial grouping – or chunking [190] – of indi- etc. However, richer types require richly-typed labeled vidual words, materializing a shallow tier of the full data to train on. One option is to use sub-class hierar- parse-tree [86,69]. In the context of NER, noun-phrase chies to select higher-level types from the KB to train with [231]. Furthermore, as previously discussed, in 18See Listing8 where “ Breaking” is tagged VGB (verb gerund/- EEL scenarios, the types of entities are often given by past participle) and “Bad” as JJ (adjective). the KB and need not be given by the NER tool: hence, 12 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web other “non-standard” types of entities can be labeled – mentions being pruned as irrelevant to the KB (or “miscellaneous” to train for generic recognition. proposed as emerging entities), On the other hand, a benefit of using parsers based – candidates being pruned as irrelevant to a men- on machine learning is that they can significantly re- tion, and/or duce the amount of lookups required on the dictionary – candidates being assigned a score – called a sup- since, unlike token-based methods, initial entity men- port – for a particular mention. tions can be detected independently of the KB dictio- In some systems, phases of pruning and scoring may nary. Likewise, such methods can be used to detect interleave, while in others, scoring is applied first and emerging entities not yet featured in the KB. pruning is applied (strictly) thereafter. A wide variety of approaches to disambiguation can Hybrid: The techniques described previously are be found in the EEL literature. Our goal, in this survey, sometimes complementary, where a number of sys- is thus to organize and discuss the main approaches tems thus apply hybrid approaches combining vari- used thus far. Along these lines, we will first discuss ous such techniques. One such system is ADEL [247], some of the low-level features that can be used to help which uses a mix of three high-level recognition tech- with the disambiguation process. Thereafter we dis- niques: persons, organizations and locations are ex- cuss how these features can be combined to select a fi- tracted using Stanford NER; mentions based on proper nal set of mentions and candidates and/or to compute nouns are extracted using Stanford POS; and more a support for each candidate identifier of a mention. challenging mentions not based on proper nouns are extracted using an (unspecified) dictionary approach; 2.2.1. Features for Disambiguation entity mentions produced by all three approaches are In order to perform disambiguation of candidate KB fed into a unified disambiguation and pruning phase. identifiers for an entity mention, one may consider in- A similar approach is taken by the FOX (Federated formation relating to the mention itself, to the key- knOwledge eXtraction Framework)[287], which uses words surrounding the mention, to the candidates for ensemble learning to combine the results of four NER surrounding mentions, and so forth. In fact, a range of tools – namely Stanford NER, Illinois NET, Ottawa features have been proposed in the literature to support BalIE, and OpenNLP – where the resulting entity men- the disambiguation process. To structure the discussion tions are then passed through the AGDISTIS [301] tool of such features, we organize them into the following to subsequently link them to DBpedia. five high-level categories: Mention-based (M): Such features rely on informa- 2.2. Disambiguation tion about the entity mention itself, such as its text, capitalization, the recognition score for the We assume that a list of candidate identifiers has mention, the presence of overlapping mentions, now been retrieved from the KB for each mention or the presence of abbreviated mentions. of interest using the techniques previously described. Keyword-based (K): Such features rely on collecting However, some KB labels in the dictionary may be contextual keywords for candidates and/or men- ambiguous and may refer to multiple candidate iden- tions from reference sources of text (often using tifiers. Likewise, the mentions in the text may not ex- Wikipedia). Keyword-based similarity measures actly match any single label in the dictionary. Thus can then be applied over pairs or sets of contexts. an individual mention may be associated with mul- Graph-based (G): Such features rely on constructing tiple initial candidates from the KB, where a distin- a (weighted) graph representing mentions and/or guishing feature of EEL systems is the disambiguation candidates and then applying analyses over the phase, whose goal is to decide which KB identifiers graph, such as to determine cocitation measures, best match which mentions in the text. To achieve this, dense-subgraphs, distances, or centrality. the disambiguation phase will typically involve vari- Category-based (C): Such features rely on categori- ous forms of filtering and scoring of the initial candi- cal information that captures the high-level do- date identifiers, considering both the candidates for in- main of mentions, candidates and/or the input text dividual entity mentions, as well as (collectively) con- itself, where Wikipedia categories are often used. sidering candidates proposed for entity mentions in a Linguistic-based (L): Such features rely on the gram- region of the text. Disambiguation may thus result in: matical role of words, or on the grammatical rela- J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 13
tion between words or chunks in the text (as pro- ity.19 A natural limitation of such a feature is that it will duced by traditional NLP tools). score different candidates with the same labels with precisely the same score; hence such features are typi- These categories reflect the type of information from cally combined with other disambiguation features. which the features are extracted and will be used to Whenever the recognition phase produces a type for structure this section, allowing us to introduce in- entity mentions independently of the types available creasingly more complex types of sources from which in the KB – as typically happens when a traditional to compute features. However, we can also consider NER tool is used – this NER-based type can be com- an orthogonal conceptualization of features based on pared with the type of each candidate in the KB. Given what they say about mentions or candidates: that relatively few types are produced by NER tools (without using the KB) – where the most widely ac- Mention-only (mo): A feature about the mention in- cepted types are person, organization and location – dependent of other mentions or candidates. these types can be mapped manually to classes in the Mention–mention (mm): A feature between two or KB, where class inference techniques can be applied more mentions independent of their candidates. to also capture candidates that are instances of more Candidate-only (co): A feature about a candidate in- specific classes. We note that both ADEL [247] and J- dependent of other candidates or mentions. NERD [231] incorporate such a feature (both recently Mention–candidate (mc): A feature about the candi- proposed approaches). While this can be a useful fea- date of a mention independent of other mentions. ture for disambiguating some entities, the KB will of- Candidate–candidate (cc): A feature between two or ten contain types not covered by the NER tool (at least more candidates independent of their mentions. using off-the-shelf pre-trained models). Various (v): A feature that may involve multiple of The recognition process itself may produce a score the above, or map mentions and/or candidates to for a mention indicating a confidence that it is refer- a higher-level (or latent) feature, such as domain. ring to a (named) entity; this can additionally be used as a feature in the disambiguation phase, where, for We will now discuss these features in more detail in example, a mention for which only weakly-related KB order of the type of information they consider. candidates are found is more likely to be rejected if its recognition score is also low. A simple such feature Mention-based: With respect to disambiguation, im- may capture capitalization, where HERD [283] and portant initial information can be gleaned from the Tulip [180] mark lower-case mentions as “tentative” mentions themselves, both in terms of the text of the in the disambiguation phase, indicating that they need mention, the type selected by the NER tool (where stronger evidence during disambiguation not to be available), and the relation of the mention to other pruned. Another popular feature, called keyphraseness neighboring mentions in a specific region of text. by Mihalcea and Csomai [202], measures the number To begin, the strings of mentions can be used for dis- or ratio of times the mention appears in the anchor text ambiguation. While recognition often relies on match- of a link in a contextual corpus such as Wikipedia; this ing a mention to a dictionary, this process is typically feature is considered by AIDA [139], DBpedia Spot- implemented using various forms of indexes that al- light [199], NERFGUN [125], HERD [283], etc. low for efficiently matching substrings (such as pre- We already mentioned how spotting may result in fixes, suffixes or tokens) or full strings. However, once overlapping entity mentions being recognized, where, a smaller set of initial candidates has been identified, for example, the mention “York City” may over- more fine-grained string-matching can be applied be- lap with the mention “New York City”. A natural tween the respective mention and candidate labels. For approach to resolve such overlaps – used, for exam- example, given a mention “Bryan L. Cranston” and ple, by ADEL [247], AGDISTIS [301], HERD [283], two candidates with labels “Bryan L. Reuss” (as KORE [137] and Tulip [180] – is to try expand en- the longest prefix match) and “Bryan Cranston” (as tity mentions to a maximal possible match. While this a keyword match), one could apply an edit-distance seems practical, some of the “nested” entity mentions measure to refine these candidates. Along these lines, for example, ADEL [247] and NERFGUN [125] use 19More specifically, each input string is decomposed into a set of Levenshtein edit-distance, while DoSeR [336] and 3-character substrings, where the Jaccard coefficient (the cardinality AIDA-Light [230] use a trigram-based Jaccard similar- of the intersection over union) of both sets is computed. 14 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web may be worth keeping. Consider “New York City score and rank candidates. A natural idea is to con- Police Department”; while this is a maximal (aka. sider a mention as a keyword query posed against a external) entity mention referring to an organization, textual document created to describe each KB entity, it may also be valuable to maintain the nested “New where IR measures of relevance can be used to score York City” mention. As such, in the traditional NER candidates. A typical IR measure used to determine the literature, Finkel and Manning [103] argued for Nested relevance of a document to a given keyword query is Named Entity Recognition, which preserves relevant TF–IDF, where the core intuition is to consider doc- overlapping entities. However, we are not aware of any uments that contain more mentions (term-frequency: work on EEL that directly considers relevant nested TF) of relatively rare keywords (inverse-document fre- entities, though systems such as Babelfy [215] ex- quency: IDF) in the keyword query to be more relevant plicitly allow overlapping entities. In certain (proba- to that query. Another typical measure is to use cosine bly rare) cases, ambiguous overlaps without a maximal similarity, where documents (and keyword queries) are entity may occur, such as for “The third Man Ray represented as vectors in a normalized numeric space exhibition”, where “The Third Man” may refer to (known as a Vector Space Model (VSM) that may use, a popular 1949 movie, while “Man Ray” may refer to for example, numeric TF–IDF values), where the sim- an American artist; neither are nested nor external en- ilarity of two vectors can be computed by measuring tities. Though there are works on NER that model such the cosine of the angle between them. (again, probably rare) cases [182], we are not aware of Systems relying on IR-based measures for disam- EEL approaches that explicitly consider these cases. biguation include: DBpedia Spotlight [199], which We can further consider abbreviated forms of men- defines a variant called TF–ICF, where ICF denotes tions where a “complete mention” is used to intro- inverse-candidate frequency, considering the ratio of duce an entity, which is thereafter referred to using candidates that mention the term; THD [82], which a shorter mention. For example, a text may mention uses the Lucene-based search API of Wikipedia im- “Jimmy Wales” in the introduction, but in subsequent plementing measures similar to TF–IDF; SDA [42], mentions, the same entity may be referred to as sim- which builds a textual context for each KB entity from ply “Wales”; clearly, without considering the presence Wikipedia based on article titles, content, anchor text, of the longer entity mention, the shorter mention could etc., where candidates are ranked based on cosine- be erroneously linked to the country. In fact, this is a similarity; NERFGUN [125], which compares men- particular form of coreference, where short mentions, tions against the abstracts of Wikipedia articles refer- rather than pronouns, are used to refer to an entity in ring to KB entities using cosine-similarity;20 etc. subsequent mentions. A number of approaches – such Other approaches consider an extended textual con- as those proposed by Cucerzan [67] or Durrett and text not only for the KB entities, but also for the Klein [89] for linking to Wikipedia, as well as systems mentions. For example, considering the input sen- such as ADEL [247], AGDISTIS [301], KORE [137], tence “Santiago frequently experiences strong MAG [216] and Seznam [94] linking to RDF KBs earthquakes.”, although Santiago is an ambiguous – try to map short mentions to longer mentions ap- label that may refer to (e.g.) several cities, the term pearing earlier in the text. On the other hand, coun- “earthquake” will most frequently appear in connec- terexamples appear to be quite common, where, for tion with Santiago de Chile, which can be determined example, a text on Enzo Ferrari may simultaneously by comparing the keywords in the context of the men- use “Ferrari” as a mention for the person and the tion with keywords in the contexts of the candidates. car company he founded; automatically disambiguat- Such an approach is used by SemTag [80], which per- ing individual mentions may then prove difficult in forms Entity Linking with respect to the TAP KB: such cases. Hence, this feature will often be combined however, rather than build a context from an external with context features, described in the following. source like Wikipedia, the system instead extracts a Keyword-based: A variety of keyword-based tech- context from the text surrounding human-labeled in- niques from the area of Information Retrieval (IR) are stances of linked entity mentions in a reference text. relevant not only to the recognition process, but also to the disambiguation process. While recognition can be 20The term “abstracts of Wikipedia articles” refers to the first done efficiently at large scale using inverted indexes, paragraph of the Wikipedia article, which is seen as providing a tex- for example, relevance measures can be used to help tual overview of the entity in question [170,125]. J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 15
Other more modern approaches adopt a similar dis- This idea of collective assignment would become in- tributional approach – where words are considered fluential in later works linking entities to RDF-based similar by merit of appearing frequently in similar con- KBs. For example, the KORE [137] system extended texts – but using more modern techniques. Amongst AIDA [139] with a measure called keyphrase overlap these, CohEEL [121] build a statistical language model relatedness21, where mentions and candidates are as- for each KB entity according to the frequency of sociated with a keyword context, and where the relat- terms appearing in its associated Wikipedia article; this edness of two contexts is based on the Jaccard simi- model is used during disambiguation to estimate the larity of their sets of keywords; this measure is then probability of the observed keywords surrounding a used to perform a collective assignment. To avoid mention being generated if the mention referred to a computing pair-wise similarity over potentially large particular entity KB. A related approach is used in sets of candidates, the authors propose to use locality- the DoSeR system [336], where word embeddings are sensitive hashing, where the idea is to hash contexts used for disambiguation: in such an approach, words into a space such that similar contexts will be hashed are represented as vectors in a fixed-dimensional nu- into the same region (aka. bucket), allowing related- meric space where words that often co-occur with sim- ness to be computed for the mentions and candidates ilar words will have similar vectors, allowing, for ex- in each bucket. Collective assignment based on com- ample, to predict words according to their context; the paring the textual contexts of candidates would be- DoSeR system then computes word embeddings for come popular in many subsequent systems, including KB entities using known entity links to model the con- AIDA-Light [230], JERL [183], J-NERD [231], Kan- text in which those entities are mentioned in the text, Dis [144], and so forth. Collective assignment is also which can subsequently be used to predict further men- the key principle underlying many of the graph-based tions of such entities based on the mention’s context. techniques discussed in the following. Another related approach is to consider collective Graph-based: During disambiguation, useful infor- assignment: rather than disambiguating one mention mation can be gained from the graph of connec- at a time by considering mention–candidate similar- tions between entities in a contextual source such as ity, the selection of a candidate for one mention can Wikipedia, or in the target KB itself. First, graphs can affect the scoring of candidates for another mention. be used to determine the prior probability of a par- For example, considering the sentence “Santiago is ticular entity; for example, considering the sentence the second largest city of Cuba”, even though “Santiago is named after St. James.”, the con- Santiago de Chile has the highest prior probability text does not directly help to disambiguate the entity, to be the entity referred to by “Santiago” (being a but applying links analysis, it could be determined that larger city mentioned more often), one may find that (with respect to a given reference corpus) the candidate Santiago de Cuba has the strongest relation to Cuba entity most commonly spoken about using the men- (mentioned nearby) than all other candidates for the tion “Santiago” is Santiago de Chile. Second, as per mention “Santiago”—or, in other words, Santiago de the previous example for Cucerzan’s [67] keyword- Cuba is the most coherent with Cuba, which can over- based disambiguation – “Santiago is the second ride the higher prior probability of Santiago de Chile. largest city of Cuba” – one may find that a collec- While this is similar to the aforementioned distribu- tive assignment can override the higher prior probabil- tional approaches, a distinguishing feature of collec- ity of an isolated candidate to instead maintain a high tive assignment is to consider not only surrounding relatedness – or coherence – of candidates selected in keywords, but also candidates for surrounding entity a particular part of text, where similarity graphs can mentions. A seminal such approach – for EEL with re- be used to determine the coherence of candidates. spect to Wikipedia – was proposed by Cucerzan [67], A variety of entity disambiguation approaches rely where a cosine-similarity measure is applied between on the graph structure of Wikipedia, where a semi- not only the contexts of mentions and their associated nal approach was proposed by Medelyan et al. [197] candidates, but also between candidates for neighbor- and later refined by Milne and Witten [208]. The ing entity mentions; disambiguation then attempts to graph-structure of Wikipedia is used to perform dis- simultaneously maximize the similarity of mentions to ambiguation based on two main concepts: common- candidates as well as the similarity amongst the candi- dates chosen for other nearby entity mentions. 21... not to be confused with overlapping mentions. 16 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web ness and relatedness. Commonness is measured as Threshold (DT), which selects the top-ε candidates by the (prior) probability that a given entity mention is relatedness and then chooses the remaining candidate used in the anchor text to point to the Wikipedia with the highest commonness (experimentally, the au- article about a given candidate entity; as an exam- thors deem ε = 0.3 to offer the best results). ple, one could consider that the plurality of anchor While the aforementioned tools link entity mentions texts in Wikipedia containing the (ambiguous) men- to Wikipedia, other approaches linking to RDF-based tion “Santiago” would link to the article on Santi- KBs have followed adaptations of such ideas. One ago de Chile; thus this entity has a higher common- such tool is AIDA [139], which performs two main ness than other candidates. On the other hand, related- steps: collective mapping and graph reduction. In the ness is a cocitation measure of coherence based on how collective mapping step, the tool creates a weighted many articles in Wikipedia link to the articles of both graph that includes mentions and initial candidates candidates: how many inlinking documents they share as nodes: first, mentions are connected to their can- relative to their total inlinks. Thereafter, unambiguous didates by a weighted edge denoting their similarity candidates help to disambiguate ambiguous candidates as determined from a keyword-based disambiguation for neighboring entity mentions based on relatedness, approach; second, entity candidates are connected by which is weighted against commonness to compute a a weighted edge denoting their relatedness based on support for all candidates; Medelyan et al. [197] argue (1) the same notion of relatedness introduced by Milne that the relative balance between relatedness and com- and Witten [208], combined with (2) the distance be- monness depends on the context, where for example if tween the two entities in the YAGO KB. The resulting “Cuba” is mentioned close to “Santiago”, and “Cuba” graph is referred to as the mention–entity graph, whose has high commonness and low ambiguity, then this edges are weighted in a similar manner to the measures should override the commonness of Santiago de Chile considered by Kulkarni et al. [167]. In the subsequent since the context clearly relates to Cuba, not Chile. graph reduction phase, the candidate nodes with the Further approaches then built upon and refined lowest weighted degree in this graph are pruned iter- Milne and Witten’s notion of commonness and relat- atively while preserving at least one candidate entity edness. For example, Kulkarni et al. [167] propose a for each mention, resulting in an approximation of the collective assignment method based on two types of densest possible (disambiguated) subgraph. score: a compatibility score defined between a men- The concept of computing a dense subgraph of the tion and a candidate, computed using a selection of mention–entity graph was reused in later systems. For standard keyword-based approaches; and Milne and example, the AIDA-Light [230] system (a variant of Witten’s notion of relatedness defined between pairs AIDA with focus on efficiency) uses keyword-based of candidates. The goal then is to find the selection of features to determine the weights on mention–entity candidates (one per mention) that maximizes the sum and entity–entity edges in the mention–entity graph, of the compatibility scores and all pairwise relatedness from which a subgraph is then computed. As another scores amongst selected candidates. While this opti- variant on the dense subgraph idea, Babelfy [215] con- mization problem is NP-hard, the authors propose to structs a mention–entity graph but where edges be- use approximations based on integer linear program- tween entity candidates are assigned based on seman- ming and hill climbing algorithms. tic signatures computed using the Random Walk with Another approach using the notions of commonness Restart algorithm over a weighted version of a custom and relatedness is that of TagMe [99]; however, rather semantic network (BabelNet); thereafter, an approxi- than relying on the relatedness of unambiguous en- mation of the densest subgraph is extracted by itera- tities to disambiguate a context, TagMe instead pro- tively removing the least coherent vertices – consider- poses a more complex voting scheme, where the can- ing the fraction of mentions connected to a candidate didates for each entity can vote for the candidates on and its degree – until the number of candidates for each surrounding entities based on relatedness; candidates mention is below a specified threshold. with higher commonness have stronger votes. Candi- Rather than trying to compute a dense subgraph of dates with a commonness below a fixed threshold are the mention–entity graph, other approaches instead use pruned where two algorithms are then used to decide standard centrality measures to score nodes in vari- final candidates: Disambiguation by Classifier (DC), ous forms of graph induced by the candidate entities. which uses commonness and relatedness as features NERSO [124] constructs a directed graph of entity to classify correct candidates; and Disambiguation by candidates retrieved from DBpedia based on the links J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 17 between their articles on Wikipedia; over this graph, works propose to map candidates to higher-level cate- the system applies a variant on a closeness centrality gory information and use such categories to determine measure, which, for a given node, is defined as the in- the coherence of candidates. Most often, the category verse of the average length of the shortest path to all information from Wikipedia is used. other reachable nodes; for each mention, the central- The earliest approaches to use such category infor- ity, degree and type of node is then combined into a mation were those linking mentions to Wikipedia iden- final support for each candidate. On the other hand, tifiers. For example, in cases where the keyword-based the WAT system [243] extends TagMe [99] with var- contexts of candidates contained insufficient informa- ious features, including a score based on the PageR- tion to derive reliable similarity measures, Bunescu ank22 of nodes in the mention–entity graph, which and Pasca [33] propose to additionally use terms from loosely acts as a context-specific version of the com- the article categories to extend these contexts and learn monness feature. ADEL [247] likewise considers a correlations between keywords appearing in the men- feature based on the PageRank of entities in the DB- tion context and categories found in the candidate con- pedia KB, while HERD [283], DoSeR [336] and Sez- text. A similar such idea – using category information nam [94] use PageRank over variants of a mention– from Wikipedia to enrich the contexts of candidates – entity graph. Using another popular centrality-based was also used by Cucerzan [67]. measure, AGDISTIS [301] first creates a graph by ex- A number of approaches also use categorical infor- panding the neighborhood of the nodes corresponding mation to link entities to RDF KBs. An early such pro- to candidates in the KB up to a fixed width; the ap- posals was the LINDEN [277] approach, which was proach then applies Kleinberg’s HITS algorithm, using based on constructing a graph containing nodes repre- the authority score to select disambiguated entities. senting candidates in the KB, their contexts, and their In a variant of the centrality theme, Kan-Dis [144] categories; edges are then added connecting candi- uses two graph-based measures. The first measure is dates to their contexts and categories, while categories a baseline variant of Katz’s centrality applied over are connected by their taxonomic relations. Contextual the candidates in the KB’s graph [237], where a and categorical information was taken from Wikipedia. parametrized sum over the k shortest paths between A cocitation-based notion of candidate–candidate re- two nodes is taken as a measure of their relatedness latedness similar to that of Medelyan et al. [197] is such that two nodes are more similar the shorter the then combined with another candidate–candidate relat- k shortest paths between them are. The second mea- edness measure based on the probability of an entity in sure is then a weighted version of the baseline, where the KB falling under the most-specific shared ancestor edges on paths are weighted based on the number of of the categories of both entities. similar edges from each node, such that, for example, a path between two nodes through a country for the re- As previously discussed, AIDA-Light [230] de- lation “resident” will have less effect on the overall re- termines mention–candidate and candidate–candidate latedness of those nodes than a more “exclusive” path similarities using a keyword-based approach, where through a music-band with the relation “member”. the similarities are used to construct a weighted Other systems apply variations on this theme of mention–entity graph; this graph is also enhanced with graph-based disambiguation. KIM [250] selects the categorical information from YAGO (itself derived candidate related to the most previously-selected can- from Wikipedia and WordNet), where category nodes didates by some relation in the KB; DoSeR [336] like- are added to the graph and connected to the candidates wise considers entities as related if they are directly in those categories; additionally, weighted edges be- connected in the KB and considers the degree of nodes tween candidates can be computed based on their dis- in the KB as a measure of commonness; and so forth. tance in the categorical hierarchy. J-NERD [231] like- wise uses similar features based on latent topics com- Category-based: Rather than trying to measure the puted from Wikipedia’s categories. coherence of pairs of candidates through keyword con- texts or cocitations or their distance in the KB, some Linguistic-based: Some more recent approaches pro- pose to apply joint inference to combine disambigua- tion with other forms of linguistic analysis. Conceptu- 22PageRank is itself a variant of eigenvector centrality, which can be conceptualized as the probability of being at a node after an arbi- ally the idea is similar to that of using keyword con- trarily long random walk starting from a random node. texts, but with a deeper analysis that also considers fur- 18 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web ther linguistic information about the terms forming the disambiguating the entity “Boston” can help disam- context of a mention or a candidate. biguate the word sense of “rock”, and thus we have We have already seen examples of how the recogni- an interdependence between the two tasks. Babelfy tion task can sometimes gain useful information from thus combines candidates for both word senses and the disambiguation task. For example, in the sentence entity mentions into a single semantic interpretation “Nurse Ratched is a character in Ken Kesey’s graph, from which (as previously mentioned) a dense novel One Flew over the Cuckoo’s Nest”, the lat- (and thus coherent) sub-graph is extracted. Another ap- ter mention – “One Flew over the Cuckoo’s Nest” proach applying joint Named Entity Disambiguation – is a challenging example for recognition due to its and Word Sense Disambiguation is Kan-Dis [144], length, broken capitalization, uses of non-noun terms, where nouns in the text are extracted and their senses and so forth; however, once disambiguated, the related modeled as a graph – weighted by the notion of se- entities could help to find recognize the right bound- mantic relatedness described previously – from which aries for the mention. As another example, in the sen- a dense subgraph is extracted. tence “Bill de Blasio is the mayor of New York City”, disambiguating the latter entity may help rec- Summary of features: Given the breadth of features ognize the former and vice versa (e.g., avoiding demar- covered, we provide a short recap of the main features cating “Bill” or “York City” as mentions). for reference: Recognizing this interdependence of recognition Mention-based: Given the initial set of mentions and disambiguation, one of the first approaches pro- identified and the labels of their corresponding posed to perform these tasks jointly was NereL [280], candidates, we can consider: which applies a first high-recall NER pass that both underestimates and overestimates (potentially overlap- – A mention-only feature produced by the ping) mention boundaries, where features of these can- NER tool to indicate the confidence in a par- didate mentions are combined with features for the ticular mention; candidate identifiers for the purposes of a joint infer- – Mention–candidate features based on the ence step. A more complex unified model was later string similarity between mention and can- proposed by Durrett [89], which captured features not didate labels, or matches between mention only for recognition (POS-tags, capitalization, etc.) (NER) and candidate (KB) types; and disambiguation (string-matching, PageRank, etc.), – Mention–mention features based on over- but also for coreference (type of mention, mention lapping mentions, or the use of abbreviated length, context, etc.), over which joint inference is ap- references from a previous mention. plied. JERL [183] also uses a unified model for rep- Keyword-based: Considering various types of tex- resenting the NER and NED tasks, where word-level features (such as POS tags, dictionary hits, etc.) are tual contexts extracted for mentions (e.g., vary- combined with disambiguation features (such as com- ing length windows of keywords surrounding the monness, coherence, categories, etc.), subsequently al- mention) and candidates (e.g., Wikipedia anchor lowing for joint inference over both. J-NERD [231] texts, article texts, etc.), we can compute: likewise uses features based on Stanford’s POS tagger – Mention–candidate features considering var- and dependency parser, dictionary hits, coherence, cat- ious keyword-based similarity measures egories, etc., to represent recognition and disambigua- over their contexts (e.g., TF–IDF with co- tion in a unified model for joint inference. sine similarity; Jaccard similarity, word em- Aside from joint recognition and disambiguation, beddings, and so forth); other types of unified models have also been pro- – Candidate–candidate features based on the posed. Babelfy [215] applies a joint approach to model same types of similarity measures over can- and perform Named Entity Disambiguation and Word didate contexts. Sense Disambiguation in a unified manner. As an ex- Graph-based: Considering the graph-structure of a ample, in the sentence “Boston is a rock group”, reference source such as Wikipedia, or the target the word “rock” can have various senses, where know- KB, we can consider: ing that in this context it is used in the sense of a music genre will help disambiguate “Boston” as referring to – Candidate-only features, such as prior prob- the music group and not the city; on the other hand, ability based on centrality, etc.; J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 19
– Mention–candidate features, based on how tially using other features on a lower level), where many links use the mention’s text to link to for example AGDISTIS [301] relies on final HITS au- a document about the candidate; thority scores, DBpedia Spotlight [199] on TF–ICF – Candidate–candidate coherence features, scores, NERSO [124] on closeness centrality and de- such as cocitation, distance, density of sub- gree; THD [82] on Wikipedia search rankings, etc. graphs, topical coherence, etc. Another option is to parameterize weights or thresh- Category-based: Considering the graph-structure of olds for features and find the best values for them indi- a reference source such as Wikipedia, or the target vidually over a labeled dataset, which is used, for ex- KB, we can consider: ample, by Babelfy [215] to tune the parameters of its Random-Walk-with-Restart algorithm and the number – Candidate–category features based on mem- of candidates to be pruned by its densest-subgraph ap- bership of the candidate to the category; proximation, or by AIDA [139] to configure thresholds – Text–category coherence features based on and weights for prior probabilities and coherence. categories of candidates; An alternative method is to allow users to configure – Candidate–candidate features based on tax- such parameters themselves, where AIDA [139] and onomic similarity of associated categories. DBpedia Spotlight [199] offer users the ability to con- Linguistic-based: Considering POS tags, word senses, figure parameters and thresholds for prior probabili- coreferences, parse trees of the input text, etc., we ties, coherence measures, tolerable level of ambiguity, can consider: and so forth. In this manner, a human expert can con- figure the system for a particular application, for exam- – Mention-only features based on POS or ple, tuning to trade precision for recall, or vice-versa. other NER features; Yet another option is to define a general objec- – Mention–mention features based on depen- tive function that then turns the problem of selecting dency analysis, or the coherence of candi- the best candidates into an optimization problem, al- dates associated with them; lowing the final candidate assignment to be (approx- – Mention–candidate features based on coher- imately) inferred. One such method is Kulkarni et ence of sense-aware contexts; al.’s [167] collective assignment approach, which uses – Candidate–candidate features based on con- integer linear programming and hill-climbing meth- nection through semantic networks. ods to compute a candidate assignment that (approxi- This list of useful features for disambiguation is by mately) maximizes mention–candidate and candidate– no means complete and has continuously expanded candidate similarity weights. Another such method is as further Entity Linking papers have been published. JERL [183], which models entity recognition and dis- Furthermore, EEL systems may use features not cov- ambiguation in a joint model over which dynamic pro- gramming methods are applied to infer final candi- ered, typically exploiting specific information avail- dates. Systems optimizing for dense entity–mention able in a particular KB, a particular reference source, subgraphs – such as AIDA [139], Babelfy [215] or or a particular input source. As some brief examples, Kan-Dis [143] – follow similar techniques. we can mention that NEMO [68] uses geo-coordinate A final approach is to use classifiers to learn ap- information extracted from Freebase to determine propriate weights and parameters for different features a geographical coherence over candidates, Yerva et based on labeled data. Amongst such approaches, we al. [320] consider features computed from user profiles can mention that ADEL [247] performs experiments on Twitter and other social networks, ZenCrowd [75] with k-NN, Random Forest, Naive Bayes and SVM considers features drawn from crowdsourcing, etc. classifiers, finding k-NN to perform best; AIDA [139], 2.2.2. Scoring and pruning LINDEN [277] and WAT [243] use SVM variants to As we have seen, a wide range of features have been learn feature weights; HERD [283] uses logistic re- proposed for the purposes of the disambiguation task. gression to assign weights to features; and so forth. All A general question then is: how can such features be such methods rely on labeled data to train the classi- weighted and combined into a final selection of candi- fiers over; we will discuss such datasets later when dis- dates, or a final support for each candidate? cussing the evaluation of EEL systems. The most straightforward option is to consider a Such methods for scoring and classifying results can high-level feature used to score candidates (poten- be used to compute a final set of mentions and their 20 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web candidates, either selecting a single candidate for each – Domain selection: When working within a spe- mention or associating multiple candidates with a sup- cific topical domain, the amount of entities to port by which they can be ranked. consider will often be limited. However, cer- tain domains may involve types of entity men- 2.3. System summary and comparison tions that are atypical; for example, while types such as persons, organizations, locations are well- Table3 provides an overview of how the EEL tech- recognized, considering the medical domain as an niques discussed in this section are used by highlighted example, diseases or (branded) drugs may not be systems that: deal with a resource (e.g., a KB) using well recognized and may require special training one of the Semantic Web standards; deal with EEL or configuration. Examples of domain-specific over plain text; have a publication offering system de- EEL approaches include Sieve [87] (using the tails; and are standalone systems. Based on these crite- SNOMED-CT ontology), and that proposed by ria, we exclude systems discussed previously that deal Zheng et al. [329] (based on a KB constructed 23 only with Wikipedia (since they do not directly relate from BioPortal ontologies ). to the Semantic Web). In this table, Year indicates the – Text characteristics: Aside from the domain (be year of publication, KB denotes the primary Knowl- it specific or open), the nature of the text in- edge Base used for evaluation, Matching corresponds put can better suit one type of system over an- to the manner in which raw mentions are detected other. For example, even considering a fixed med- and matched with KB entities (Keyword refers to key- ical domain, Tweets mentioning illnesses will of- word search, Substring refers to methods such as prefix fer unique EEL challenges (short context, slang, matching, LSH refers to Locality Sensitive Hashing), lax capitalization, etc.) versus news articles, web- Indexing refers to the manner in which KB meta-data pages or encyclopedic articles about diseases, are indexed, Context refers to external sources used where again, certain tools may be better suited to enrich the description of entities, Recognition in- for certain input text characteristics. For exam- dicates how raw mentions are extracted from the text, ple, TagMe [99] focuses on EEL over short texts, and Disambiguation indicates the high-level strate- while approaches such as UDFS [76] and those gies used to pair each mention with its most suitable proposed by Yerva et al. [320] and Yosef et KB identifier (if any). Regarding the latter dimension, al. [322] focus more specifically on Tweets. Table4 further details the features used by each system – Language: Language can be an important factor based on the categorization given in Section 2.2.1. in the selection of an EEL system, where certain With respect to the EEL task, given the breadth of tools may rely on resources (stemmers, lemmatiz- approaches now available for this task, a challeng- ers, POS-taggers, parsers, training datasets, etc.) ing question is then: which EEL approach should I that assume a particular language. Likewise, tools choose for application X? Different options are asso- that do not use any language-specific resources ciated with different strengths and weaknesses, where may still rely to varying extents on features (such we can highlight the following key considerations in as capitalization, distinctive proper nouns, etc.) terms of application requirements: that will be present to varying extents in differ- ent languages. While many EEL tools are de- – KB selection: While some tools are general and signed or evaluated primarily around the English accept or can be easily adapted to work with ar- language, others offer explicit support for multi- bitrary KBs, other tools are more tightly cou- ple languages [268]; amongst these multilingual pled with a particular KB, relying on features in- systems, we can mention Babelfy [215], DBpedia herent to that KB or a contextual source such Spotlight [72] and MAG [216]. as Wikipedia. Hence the selection of a particu- – Emerging entities: As data change over time, new lar target KB may already suggest the suitability entities are constantly generated. An application of some tools over others. For example, ADEL may thus need to detect emerging entities, which and DBpedia Spotlight relies on the structure pro- is only supported by some approaches; for exam- vided by DBpedia; AIDA and KORE on YAGO2; ple, approaches by Hoffert et al. [136] and Guo et while ExtraLink, KIM, and SemTag are focused on custom ontologies. 23https://bioportal.bioontology.org J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 21
Table 3 Overview of Entity Extraction & Linking systems KB denotes the main knowledge-base used; Matching and Indexing refer to methods used to match/index entity labels from the KB; Context refers to the sources of contextual information used; Recognition refers to the process for identifying entity mentions; Disambiguation refers to the types of high-level disambiguation features used (M:Mention, K:Keyword, G:Graph, C:Category, L:Linguistic); ’—’ denotes no information found, not used or not applicable System Year KB Matching Indexing Context Recognition Disambiguation
Tokens Elastic ADEL [247] 2016 DBpedia Keyword Wikipedia Stanford POS M,G Couchbase Stanford NER
AGDISTIS [301] 2014 Any Keyword Lucene Wikipedia Tokens M,G
AIDA [139] 2011 YAGO2 Keyword Postgres Wikipedia Stanford NER M,K,G
Keyword Dictionary AIDA-Light [230] 2014 YAGO2 Wikipedia Tokens M,K,G,C LSH LSH
Wikipedia Babelfy [215] 2014 WordNet Substring — Wikipedia Stanford POS M,G,L BabelNet
CohEEL [121] 2016 YAGO2 Keywords — Wikipedia Stanford NER K, G
DBpedia Spotlight [199] 2011 DBpedia Substring Aho–Corasick Wikipedia LingPipe POS M,K
DBpedia Exact DoSeR [336] 2016 Custom Wikipedia — M, G YAGO3 Keywords
ExPoSe [238] 2014 DBpedia Substring Aho–Corasick Wikipedia LingPipe POS M,K
ExtraLink [34] 2003 Custom (Tourism) — — — SProUT XTDL [Manual]
GianniniCDS [116] 2015 DBpedia Substring SPARQL Wikipedia — C
Freebase JERL [183] 2015 — — Wikipedia Hybrid (CRF) K,G,C,L Satori
J-NERD [231] 2016 YAGO2 Keyword Dictionary Wikipedia Hybrid (CRF) M,K,C,L
DBpedia Kan-Dis [144] 2015 Keyword Lucene Wikipedia Stanford NER K,G,L Freebase
KIM [250] 2004 KIMO Keyword Hashmap — GATE JAPE G
Keyword KORE [137] 2012 YAGO2 Postgres Wikipedia Stanford NER M,K,G LSH
LINDEN [277] 2012 YAGO1 — — Wikipedia — C
Keyword MAG [216] 2017 Any Lucene Wikipedia Stanford POS M,G Substring
UIUC NER NereL [280] 2013 Freebase Keyword Freebase API Wikipedia Illinois Chunker M,K,G,C,L Tokens
NERFGUN [125] 2016 DBpedia Substring Dictionary Wikipedia — M, K, G
NERSO [124] 2012 DBpedia Exact SPARQL Wikipedia Tokens G
SDA [42] 2011 DBpedia Keyword — Wikipedia Tokens K
SemTag [80] 2003 TAP Keyword — Lab. Data Tokens K
THD [82] 2012 DBpedia Keyword Lucene Wikipedia GATE JAPE K,G
Weasel [298] 2015 DBpedia Substring Dictionary Wikipedia Stanford NER K, G 22 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
Table 4 Overview of disambiguation features used by EEL systems (M:Metric-based, K:Keyword-based, G:Graph-based, C:Category-based, L:Linguistic-based) (mo:mention-only, mm:mention–mention, mc:mention–candidate, co:candidate-only, cc:candidate–candidate; v:various)
– [K | cc] – [K | mc] – [G | cc] – [M | mm] – [G | mc] – [M | mc]– [M | mc] – [G | cc] – [M | mo] – [M | mm] – [G | co]– [C | v]– [L | v]
System String similarityType comparisonKeyphrasenessOverlappingAbbreviations mentionsMention–candidateCandidate–candidateCommonness contextsRelatedness /contexts PriorRelatedness (Cocitation)Centrality (KB)CategoriesLinguistic ADEL [247]
AGDISTIS [301]
AIDA [139]
AIDA-Light [230]
Babelfy [215]
CohEEL [121]
DBpedia Spotlight [199]
DoSeR [336]
ExPoSe [238]
GianniniCDS [116]
JERL [183]
J-NERD [231]
Kan-Dis [144]
KIM [250]
KORE [137]
LINDEN [277]
MAG [216]
NereL [280]
NERFGUN [125]
NERSO [124]
SDA [42]
SemTag [80]
THD [82]
Weasel [298] J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 23
al. [123] extract emerging entities with NIL anno- In summary, no one EEL system fits all and EEL tations in cases where the confidence of KB can- remains an active area of research. In order to exploit didates is below a threshold. On the other hand, the inherent strengths and weaknesses of different EEL even if an application does not need recognition systems, a variety of ensemble approaches have been of emerging entities, when considering a given proposed. Furthermore, a wide variety of benchmarks approach or tool, it may be important to con- and datasets have been proposed for evaluating and sider the cost/feasibility of periodically updating comparing such systems. We discuss ensemble sys- the KB in dynamic scenarios (e.g., recognizing tems and EEL evaluation in the following sections. emerging trends in social media). – Performance and overhead: In scenarios where 2.4. Ensemble systems EEL must be applied over large and/or highly dy- namic inputs, the performance of the EEL sys- As previously discussed, different EEL systems may tem becomes a critical consideration, where tools be associated with different strengths and weaknesses. can vary in orders of magnitude with respect to A natural idea is then to combine the results of mul- runtimes. Likewise, EEL systems may have pro- tiple EEL systems in an ensemble approach (as seen hibitive hardware requirements, such as having elsewhere, for example, in Machine Learning algo- to store the entire dictionary in primary memory, rithms [78]). The main goal of ensemble methods is and/or the need to collectively model all men- to thus try to compare and exploit complementary as- tions and entities in a given text in memory, etc. pects of the underlying systems such that the results The requirements of a particular system can then obtained are better than possible using any single such be an important practical factor in certain scenar- system. Five such ensemble systems are: ios. For example, the AIDA-Light [230] system NERD (2012) [263] (Named Entity Recognition and greatly improves on the runtime performance of Disambiguation) uses an ontology to integrate AIDA [321], with a slight loss in precision. the input and output of ten NER and EEL tools, – Output quality: Quality is often defined as “fit for namely AlchemyAPI, DBpedia Spotlight, Evri, purpose”, where an EEL output fit for one appli- Extractiv, Lupedia, OpenCalais, Saplo, Wikimeta, cation/purpose might be unfit for another. For ex- Yahoo! Content Extractor, and Zemanta. Later ample, a semi-supervised application where a hu- works proposed classifier-based methods (Naive man expert will later curate links might empha- Bayes, k-NN, SVM) for combining results [264]. size recall over the precision of the top-ranked BEL (2014) [334] (Bagging for Entity Linking) Rec- candidate chosen, since rejecting erroneous can- ognizes entity mentions through Stanford NER, didates is faster than searching for new ones man- later retrieving entity candidates from YAGO that ually [75]. On the other hand, a completely auto- are disambiguated by means of a majority-voting matic system may prefer a cautious output, priori- algorithm according to various ranking classifiers tizing precision over recall. Likewise, some appli- applied over the mentions’ contexts. cations may only care if an entity is linked once Dexter (2014) [40] uses TagMe and WikiMiner com- in a text, while others may put a high priority on bined with a collective linking approach to match repeated (short) mentions also being linked. Dif- entity mentions in a text with Wikipedia identi- ferent purposes provide different instantiations of fiers, where they propose to be able to switch ap- the notion of quality, and thus may suggest the proaches depending on the features of the input fitness of one tool over another. Such variability document(s), such as domain, length, etc. of quality is seen in, for example, GERBIL [302] NTUNLP (2014) [47] performs EEL with respect to 24 benchmark results , where the best system for Freebase using a combination of DBpedia Spot- one dataset may perform worse in another dataset light and TagMe results, extended with a cus- with different characteristics. tom EEL method using the Freebase search API. – Various other considerations, such as availabil- Thresholds are applied over all three methods and ity of software, availability of appropriate train- overlapping mentions are filtered. ing data, licensing of software, API restrictions, WESTLAB (2016) [41] uses Stanford NER & ADEL costs, etc., will also often apply. to recognize entity mentions, subsequently merg- ing the output of four linking systems, namely 24http://gerbil.aksw.org/gerbil/overview AIDA, Babelfy, DBpedia Spotlight and TagMe. 24 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
2.5. Evaluation Entity Linking tools. It was manually verified to have a precision of 0.94 and is available online27. EEL involves two high-level tasks: recognition and IITB [167] The IITB dataset contains 103 English disambiguation. Thus, evaluation may consider the webpages taken from a handful of domains relat- recognition phase separately, or the disambiguation ing to sports, entertainment, science and technol- phase separately, or the entire EEL process as a ogy; the text of the webpages is scraped and semi- whole. Given that the evaluation of recognition is well- automatically linked with Wikipedia. The dataset covered by the traditional NER literature, here we fo- was first proposed and used by Kulkarni [167] cus on evaluations that consider whether or not the rec- and later used by AGDISTIS [301]. ognized mentions are deemed correct and whether or Meij [198] This dataset contains 562 manually anno- not the assigned KB identifier is deemed correct. tated tweets sampled from 20 “verified users” on Given the wide range of EEL approaches proposed Twitter and linked with Wikipedia. The dataset in the literature, we do not discuss details of the eval- was first proposed by Meij et al. [198], and later uations of individual tools conducted by the authors used by Cornolti et al. [62] to form part of a more themselves. Rather we will discuss some of the most general purpose EEL benchmark. commonly used evaluation datasets and then discuss KORE-50 [137] The KORE-50 dataset contains 50 evaluations conducted by third parties to compare var- English sentences designed to offer a challeng- ious selections of EEL systems. ing set of examples for Entity Linking tools; Datasets: A variety of datasets have been used to the sentences relate to various domains, includ- evaluate the EEL process in different settings and un- ing celebrities, music, business, sports, and poli- der different assumptions. Here we enumerate some tics. The dataset emphasizes short sentences, en- datasets that have been used to evaluate multiple tools: tity mentions with a high number of occurrences, highly ambiguous mentions, and entities with low AIDA–CoNLL [139]: The CoNLL-2003 dataset25 prior probability. The dataset was first proposed consists of 1,393 Reuters’ news articles whose and used for KORE [137], and later reused by Ba- entities were manually identified and typed for belfy [215] and Kan-Dis [144], amongst others. the purposes of training and evaluating tradi- MEANTIME [211] The MEANTIME dataset con- tional NER tools. For the purposes of training sists of 120 English Wikinews articles on top- and evaluating AIDA [139], the authors manu- ics relating to finance, with translations to Span- ally linked the entities to YAGO. This dataset was ish, Italian and Dutch. Entities are annotated with later used by ADEL [247], AIDA-Light [230], links to DBpedia resources. This dataset has been Babelfy [215], HERD [283], JERL [183], J- recently used by ADEL [247]. NERD [231], KORE [137], amongst others. MSNBC [67] The MSNBC dataset contains 20 En- AQUAINT [208] The AQUAINT dataset contains 50 glish news articles from 10 different categories, English documents collected from the Xinhua which were semi-automatically annotated. The News Service, New York Times, and the Associ- dataset was proposed and used by Cucerzan [67], ated Press. Each document contains about 250– and later reused to evaluate AGDISTIS [301], 300 words, where the first mention of an en- LINDEN [277] and by Kulkarni et al. [167]. tity is manually linked to Wikipedia. The dataset VoxEL [267] The VoxEL dataset contains 15 news was first proposed and used by Milne and Wit- articles (on politics) in 5 different languages ten [208], and later used by AGDISTIS [301]. sourced from the VoxEurop website.28 It was ELMD [239] The ELMD dataset contains 47,254 sen- manually annotated with the NIFify system 29 us- tences with 92,930 annotated and classified entity ing two different criteria for labelling: a strict ver- mentions extracted from a collection of Last.fm sion containing 204 annotated mentions (per lan- artist biographies. This dataset was automatically guage) of persons, organizations and locations; annotated through the ELVIS system26, which ho- and a relaxed version containing 674 annotated mogenizes and combines the output of different mentions (per language) of Wikipedia entities.
25https://www.clips.uantwerpen.be/conll2003/ner. 27https://www.upf.edu/web/mtg/elmd tgz 28https://voxeurop.eu/ 26https://github.com/sergiooramas/elvis 29https://github.com/henryrosalesmendez/NIFify J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 25
WP [137] The WP dataset samples English Wikipedia black box tools managed to detect thousands of en- articles relating to heavy metal musical groups. tities, DBpedia Spotlight only detected 16 entities in Articles with related categories are retrieved total; to the best of our knowledge, the quality of and sentences with at least three named entities the entities extracted was not evaluated. However, in (found by anchor text in links) are kept; in total, follow-up work by Rizzo et al. [264], the authors use 2019 sentences are considered. The dataset was the AIDA–CoNLL dataset and a Twitter dataset to first proposed and used for the KORE [137] sys- compare the linking precision, recall and F-measure tem and also later used by AIDA-Light [230]. of Alchemy, DataTXT, DBpedia Spotlight, Lupedia, TextRazor, THD, Yahoo! and Zemanta. In these ex- Aside from being used for evaluation, we note that periments, Alchemy generally had the highest recall, such datasets – particular larger ones like AIDA- DataTXT or TextRazor the highest precision, while CoNLL – can be (and are) used for training purposes. TextRazor had the best F-measure for both datasets. Moreover, although varied gold standard datasets have Gangemi [110] presented an evaluation of tools for been proposed for EEL, Jha et al. [151] stated some Knowledge Extraction on the Semantic Web (or tools issues regarding such datasets, for example, data con- trivially adaptable to such a setting). Using a sample sensus (there is a lack of consensus on standard rules text obtained from an extract of an online article of for annotating entities), updates (KB links change over The New York Times30 as input, he evaluated the pre- time), and annotation quality (regarding the number cision, recall, F-measure and accuracy of several tools and expertise of evaluations and judges of the dataset). for diverse tasks, including Named Entity Recogni- Thus, Jha et al. [151] propose the Eaglet system for tion, Entity Linking (referred to as Named Entity Res- detecting such issues over existing datasets. olution), Topic Detection, Sense Tagging, Terminol- Metrics Traditional metrics such as Precision, Re- ogy Extraction, Terminology Resolution, Relation Ex- call, and F-measure are applied to evaluate EEL sys- traction, and Event Detection. Focusing on the EEL tems. Moreover, micro and macro variants are also ap- task, he evaluated nine tools: AIDA, Stanbol, Cicero- plied in systems such as AIDA [321], DoSeR [336] Lite, DBpedia Spotlight, FOX, FRED+Semiosearch, and frameworks such as BAT [62] and GERBIL [302]; NERD, Semiosearch and Wikimeta. In these results, taking Precision as an example, macro-Precision con- AIDA, CiceroLite and NERD had perfect precision siders the average Precision over individual documents (1.00), while Wikimeta had the highest recall (0.91); or sentences, while micro-Precision considers the en- in a combined F-measure, Wikimeta fared best (0.80), tire gold standard as one test without distinguishing with AIDA (0.78) and FOX (0.74) and CiceroLite the individual documents or sentences. Other systems (0.71) not far behind. On the other hand, the observed and frameworks may use measures that distinguish the precision (0.75) and in particular recall (0.27) of DB- type of entity or the type of mention, where, for exam- pedia Spotlight was relatively low. ple, the GERBIL framework distinguishes results for Cornolti et al. [62] presented an evaluation frame- KB entities from emerging entities. work for Entity Linking systems, called the BAT- framework.31 The authors used this framework to eval- Third-party comparisons: A number of third-party uate five systems – AIDA, DBpedia Spotlight, Illinois evaluations have been conducted in order to compare Wikifier, M&W Miner and TagMe (v2) – with respect various EEL tools. Note that we focus on evaluations to five publicly available datasets – AIDA–CoNLL, that include a disambiguation step, and thus exclude AQUAINT, IITB, Meij and MSNBC – that offer a studies that focus only on NER (e.g., [134]). mix of different types of inputs in terms of domains, As previously discussed, Rizzo and Troncy [118] lengths, densities of entity mentions, and so forth. In proposed the NERD approach to integrate various En- their experiments, quite consistently across the various tity Linking tools with online APIs. They also pro- datasets and configurations, AIDA tended to have the vided some comparative results for these tools, namely highest precision, TagMe and W&M Miner tended to Alchemy, DBpedia Spotlight, Evri, OpenCalais and have the highest recall, while TagMe tended to have the Zemanta [263]. More specifically, they compared the number of entities detected by each tool from 1,000 30http://www.nytimes.com/2012/12/09/world/ New York Times articles, considering six entity types: middleeast/syrian-rebels-tied-to-al-qaeda-play- person, organization, country, city, time and num- key-role-in-war.html ber. These results show that while the commercial 31https://github.com/marcocor/bat-framework 26 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web highest F-measure; one exception to this trend was the Challenge events: A variety of EEL-related chal- IITB dataset based on long webpages, where DBpe- lenge events have been co-located with conferences dia Spotlight had the highest recall (0.50), while AIDA and workshops, providing a variety of standardized had very low recall (0.04); on the other hand, for this tasks and calling for participants to apply their tech- dataset, M&W Miner had the best F-measure (0.52). niques to the tasks in question and submit their re- An interesting aspect of Cornolti et al.’s study is that it sults. These challenge events thus offer an interesting includes performance experiments, where the authors format for empirical comparison of different tools in found that TagMe was an order of magnitude faster for this space. Amongst such events considering an EEL- the AIDA–CoNLL dataset than any other tool while related task, we can mention: still achieving the best F-measure on that dataset; on the other hand, AIDA and DBpedia Spotlight were Entity Recognition and Disambiguation (ERD) is a amongst the slowest tools, being around 2–3 orders of challenge at the Special Interest Group on In- magnitude slower than TagMe. formation Retrieval Conference (SIGIR), where Trani et al. [40] and Usbeck et al. [302] later the ERD’14 challenge [37] featured two tasks for provided evaluation frameworks based on the BAT- linking mentions to Freebase: a short-text track framework. First, Trani et al. proposed the DEXTER- considering 500 training and 500 test keyword EVAL, which allows to quickly load and run evalua- searches from a commercial engine, and a long- tions following the BAT framework.32 Later, Usbeck et text track considering 100 training and 100 test- al. [302] proposed GERBIL33, where the tasks defined ing documents scraped from webpages. for the BAT-framework are reused. GERBIL addition- ally packages six new tools (AGDISTIS, Babelfy, Dex- Knowledge Base Population (KBP) is a track at the ter, NERD, KEA and WAT), six new datasets, and of- NIST Text Analysis Conference (TAC) with an fers improved extensibility to facilitate the integration Entity Linking Track, providing a variety of EEL- of new annotators, datasets, and measures. However, related tasks (including multi-lingual scenarios), the focus of the paper is on the framework and al- as well as training corpora, validators and scorers 34 though some results are presented as examples, they for task performance. only involve particular systems or particular datasets. Making Sense of Microposts (#Microposts2016) is Derczynski et al. [77] focused on a variety of tasks a workshop at the World Wide Web Confer- over tweets, including NER/EL, which has unique ence (WWW) with a Named Entity rEcogni- challenges in terms of having to process short texts tion and Linking (NEEL) Challenge, providing with little context, heavy use of abbreviated men- a gold standard dataset for evaluating named en- tions, lax capitalization and grammar, etc., but also has tity recognition and linking tasks over microposts, unique opportunities for incorporating novel features, such as found on Twitter.35 such as user or location modeling, tags, followers, and Open Knowledge Extraction (OKE) is a challenge so forth. While a variety of approaches are evaluated hosted by the European Semantic Web Confer- for NER, with respect to EEL, the authors evaluated ence (ESWC), which typically contains two tasks, four systems – DBpedia Spotlight, TextRazor, YODIE the first of which is an EEL task using the GER- and Zemanta – over two Twitter datasets – a custom BIL framework [302]; ADEL [247] won in 2015 dataset (where entity mentions are given to the system while WESTLAB [41] won in 2016.36 for disambiguation) and the Meij dataset (where the Workshop on Noisy User-generated Text (W-NUT) raw tweet is given). In general, the systems struggled is hosted by the Annual Meeting of the Associa- in both experiments. YODIE – a system with adapta- tion for Computational Linguistics (ACL), which tions for Twitter – performed best in the first disam- provides training and development data based on biguation task (note that TextRazor was not tested). In the CoNLL data format.37 the second task, DBpedia had the best recall (0.48), TextRazor had the highest precision (0.65) while Ze- manta had the best F-measure (0.41) (note that YODIE 34http://nlp.cs.rpi.edu/kbp/2014/ was not run in this second test). 35http://microposts2016.seas.upenn.edu/challenge. html 36https://project-hobbit.eu/events/open- 32https://github.com/diegoceccarelli/dexter-eval knowledge-extraction-oke-challenge-at-eswc-2017/ 33http://aksw.org/Projects/GERBIL.html 37http://noisy-text.github.io/2017/ J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 27
We highlight the diversity of conferences at which adopt a more relaxed notion of “entity”; for ex- such events have been hosted – covering Linguis- ample, the KORE dataset contains the element tics, the Semantic Web, Natural Language Processing, dbr:Rock_music that might be considered by Information Retrieval, and the Web – indicating the some as a concept and not an entity (and hence the broad interest in topics relating to EEL. subject of word sense disambiguation [224,215] rather than EEL). There is also a lack of con- 2.6. Summary sistency regarding how emerging entities not in the KB, overlapping entities, coreferences, etc., Many EEL approaches have been proposed in the should be handled [179,269]. Thus, a key open past 15 years or so – in a variety of communities – question relates to finding an agreement on what for matching entity mentions in a text with entity iden- entities may, should and/or must be extracted and tifiers in a KB; we also notice that the popularity of linked as part of the EEL process. such works increased immensely with the availability – Multilingual EEL: EEL approaches have tra- of Wikipedia and related KBs. Despite the diversity in ditionally focused on English texts. However, proposed approaches, the EEL process is comprised of more and more approaches are considering EEL two conceptual steps: recognition and disambiguation. over non-English texts, including Babelfy [215], In the recognition phase, entity mentions in the text MAG [216], THD [82], and updated versions of are identified. In EEL scenarios, the dictionary will of- legacy systems such as DBpedia Spotlight [72]. ten play a central role in this phase, indexing the la- Such systems face a number of open challenges, bels of entities in the KB as well as contextual infor- including the development of language-specific mation. Subsequently, mentions in the text referring to or language-agnostic components (e.g., having the dictionary can be identified using string-, token- POS taggers for different languages), the dispar- or NER-based approaches, generating candidate links ity of reference information available for different to KB identifiers. In the disambiguation phase, can- languages (e.g., Wikipedia is more complete for didates are scored and/or selected for each mention; English than other languages), as well as being here, a wide range of features can be considered, re- robust to language variations (e.g., differences in lying on information extracted about the mention, the alphabet, capitalization, punctuation) [268]. keywords in the context of the mentions and the candi- – Specialized Settings: While the majority of EEL dates, the graph induced by the similarity and/or relat- approaches consider relatively clean and long text edness of mentions and candidates, the categories of an documents as input – such as news articles – other external reference corpus, or the linguistic dependen- applications may require EEL over noisy or short cies in the input text. These features can then be com- text. One example that has received attention re- bined by various means – thresholds, objective func- cently is the application of EEL methods over tions, classifiers, etc. – to produce a final candidate for Twitter [320,322,77,76], which presents unique each mention or a support for each candidate. challenges – such as the frequent use of slang and abbreviations, a lack of punctuation and capital- 2.7. Open Questions ization, as well as having limited context – but also present unique opportunities – such as lever- While the EEL task has been widely studied in re- aging user profiles and social context. Beyond cent years, many important research questions remain Twitter, EEL could be applied in any number of open, where our survey suggests the following: specialized settings, each with its own challenges and opportunities, raising further open questions. – Defining “Entity”: A foundational question that – Novel Techniques: Improving the precision and remains open is to rigorously define what is an recall of EEL will likely remain an open question “entity” in the context of EEL [179,151,269]. The for years to come; however, we can identify two traditional definition from the NER community main trends that are likely to continue into the fu- considers mentions of entities from fixed classes, ture. The first trend is the use of modern Machine such as Person, Place, or Organization. How- Learning techniques for EEL; for example, Deep ever, EEL is often conducted with respect to KBs Learning [120] has been investigated in the con- that contain entities from potentially hundreds of text of improving both recognition [85] and dis- classes. Hence some tools and datasets choose to ambiguation [105]. The second trend is towards 28 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
approaches that consider multiple related tasks Terminology Extraction (TE): Given a corpus we in a joint approach, be it to combine recogni- know to be in a given domain, we may be inter- tion and disambiguation [183,231], or to combine ested to learn what terms/concepts are core to the word sense disambiguation and EEL [215,144], terminology of that domain.38 (See Listing 12, etc. Novel techniques are required to continue to AppendixA, for an example.) improve the quality of EEL results. Keyphrase Extraction (KE): This task focuses on – Evaluation: Though benchmarking frameworks extracting important keyphrases for a given doc- 39 such as BAT [62] and GERBIL [302] represent ument. In contrast with TE, which focuses on important milestones towards better evaluating extracting important concepts relevant to a given and comparing EEL systems, potentially much domain, KE is focused on extracting important more work can be done along these lines. With concepts relevant to a particular text. (See List- respect to datasets, creating gold standards of- ing 13, AppendixA, for an example.) ten requires significant manual labor, where mis- Topic Modeling (TM): The goal of Topic Modeling takes may sometimes be introduced in the an- is to analyze cooccurrences of related keywords and cluster them into candidate grouping that po- notation process [151], or datasets may be la- tentially capture higher-level semantic “topics”.40 beled with respect to incompatible notions of “en- (See Listing 14, AppendixA, for an example.) tity” [179,151,269]. Moreover, the domains [87] and languages [268] covered by existing datasets There is a clear connection between TE and KE: are limited. Aside from the need for more la- though the goals are somewhat divergent – the former beled datasets, evaluations tend to consider EEL focuses on understanding the domain itself while the systems as complex black boxes, which obfus- latter focuses on categorizing documents – both re- cates the reasons for a particular system’s success quire extraction of domain terms/keyphrases from text and hence we summarize works in both areas together. or failure; more fine-grained evaluation of tech- Likewise, the methods employed and the results niques – rather than systems – could potentially gained through TE and KE may also overlap with the offer more fundamental insights into the EEL pro- previously studied task of Entity Extraction & Link- cess, leading to further research questions. ing (EEL). Abstractly, one can consider EEL as focus- ing on the extraction of individuals, such as “Saturn”; on the other hand, TE and KE focus on the extrac- 3. Concept Extraction & Linking tion of conceptual terms, such as “planets”. However, this distinction is often fuzzy, since TE and KE ap- A given corpus may refer to one or more domains, proaches may identify “Saturn” as a term referring to such as Medicine, Finance, War, Technology, and so a domain concept, while EEL approaches may identify forth. Such domains may be associated with vari- “planets” as an entity mention. Indeed, some papers ous concepts indicating a more specific topic, such that claim to perform Keyphrase Extraction are indis- as “breast cancer”, “solid state disks”, etc. tinguishable from techniques for performing entity ex- Concepts (unlike many entities) are often hierarchical, traction/linking [202], and vice versa. where, for example, “breast cancer”, “melanoma”, However, we can draw some clear general distinc- etc., may indicate concepts that specialize the more tions between EEL and the domain extraction tasks general concepts of “cancer”, which in turn special- discussed in this section: the goal in EEL is to extract izes the concept of “disease”, etc. all entities mentioned, while the goal in TE and KE is For the purposes of this section, we coin the generic to extract a succinct set of domain-relevant keywords phrase Concept Extraction & Linking to encapsulate that capture the terminology of a domain or the subject of a document. When compared with EEL, another dis- the following three related but subtly distinct Informa- tinguishing aspect of TE, KE and TM is that while the tion Extraction tasks – as discussed in AppendixA– that can be brought to bear in terms of gaining a greater understanding of the concepts spoken about in a cor- 38Also known as Term Extraction [98], Term Recognition [12], pus, which in turn can help, for example, to understand Vocabulary Extraction [83], Glossary Extraction [60], etc. 39Often simply referred to as Keyword Extraction [214,150] the important concepts in the domain that a collection 40Sometimes referred to as topic extraction [111] or topic classi- of documents are about, or the topic of a document. fication [304] J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 29
Listing 2: Concept Extraction and Linking example ized ontology [227], termino-ontology [227] or simple ontology [43]), which captures terms referring to im- Input: [Breaking Bad Wikipedia article text] portant concepts in the domain, potentially including a Output sample(TE/KE): taxonomic hierarchy between concepts or identifying primetime emmy awards (TE) golden globe awards (TE) terms that are aliases for the same concept. The result- american crime drama television series (TE, KE) amc network (KE) ing concepts (and hierarchy) may be used, for exam- ple, in a semi-automated ontology building process to Output(CEL): 41 dbr:Breaking_Bad dct:subject seed or extend the concepts in the ontology. dbc:Primetime_Emmy_Award_winners, Other applications relate to categorizing documents dbc:Golden_Globe_winners, dbc:American_crime_drama_television_series, in a corpus according to their key concepts, and thus dbc:AMC_(TV_channel)_network_shows . by topic and/or domain; this is the focus of the KE and TM tasks in particular. When these high-level topics are related back to a particular KB, this can enable var- former task will produce a flat list of candidate identi- ious forms of semantic search [122,172,295], for ex- fiers for entity mentions in a text, the latter tasks (of- ample to navigate the hierarchy of domains/topics rep- ten) go further and attempt to induce hierarchical rela- resented by the KB while browsing or searching doc- tions or clusters from the extracted terminology. uments. Other applications include text enrichment or In this section, we discuss works relating to TE, KE semantic annotation whereby terms in a text are tagged and TM that directly relate to the Semantic Web, be it with structured information from a reference KB or on- to help in the process of building an ontology or KB, tology [308,161,223,307]. or using an ontology or KB to guide the extraction pro- cess, or linking the results of the extraction process to Process: The first step in all such tasks is the extrac- an ontology or KB. We highlight that this section cov- tion of candidate domain terms/keywords in the text, ers a wide diversity of works from authors working in which may be performed using variations on the meth- a wide variety of domains, with different perspectives, ods for EEL; this process may also involve a reference using different terminology; hence our goal is to cover ontology or KB used for dictionary or learning pur- the main themes and to abstract some common aspects poses, or to seed patterns. The second step is to per- of these works rather than to capture the full detail of form a filtering of the terms, selecting only those that all such heterogeneous approaches. best reflect the concepts of the domain or the subject of Example: A sample of TE and KE results are pro- a document. A third optional step is to induce a hier- vided in Listing2, based on the examples provided in archy or clustering of the extracted terms, which may Listings 12 and 13 (see AppendixA). One motivation lead to either a taxonomy or a topic model; in the case for applying such techniques in the context of the Se- of a topic model, a further step may be to identify a mantic Web is to link the extracted terms with disam- singular term that identifies each cluster. A final op- biguated identifiers from a KB. The example output tional step may be to link terms or topic identifiers to consists of (hypothetical) RDF triples linking extracted an existing KB, including disambiguation where nec- terms to categorical concepts described in the DBpedia essary (if not already implicit in a previous step). In KB. These linked categories in the KB are then asso- fact, the steps described in this process may not always ciated with hierarchical relations that may be used to be sequential; for example, where a reference KB or generalize or specialize the topic of the document. ontology is used, it may not be necessary to induce a Applications: In the context of the Semantic Web, a hierarchy from the terms since such a hierarchy may core application of CEL tasks – and a major focus of already be given by the reference source. TE in particular – is to help with the creation, vali- dation or extension of a domain ontology. Automati- cally extracting an expressive domain ontology from 41In the context of ontology building, some authors distinguish text is, of course, an inherently challenging task that an onomasiological process from a semasiological process, where ontology learning the former process relates to taking a known concept in an ontology falls within the area of [185,225,51, and extracting the terms by which it may be referred to in a text, 31,316]. In the context of TE, the focus is on extract- while the latter process involves taking terms and extracting their ing a terminological ontology [168,98] (aka. lexical- underlying conceptual meaning in the form of an ontology [36]. 30 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
3.1. Terminology/Keyphrase Recognition semi-automatic filtering. These features can be broadly categorized as being linguistic or statistical; however, We consider a term to be a textual mention of other contextual features can be used, which will be a domain-specific concept, such as “cancer”. Terms described presently. Furthermore, filtering can be ap- may also be composed of more than one word and in- plied with respect to a domain-specific dictionary of deed by relatively complex phrases, such as “inner terms as taken from a reference KB or ontology. planets of the solar system”. Terminology then Linguistic features relate to lexical or syntactic as- refers more generically to the collection of terms or pects of the term itself, where the most basic such fea- specialized vocabulary pertinent to a particular do- ture would be the number of words forming the term main. In the context of TE, terms/terminology are typ- (more words indicating more specific terms and vice- ically understood as describing a domain, while in the versa). Other linguistic features can likewise include context of KE, keyphrases are typically understood as generic aspects such as POS tags [219,214,122,172, describing a particular document. However, the extrac- 223], shallow syntactic patterns [214,117,83,60], etc.; tion process of TE and KE are similar; hence we pro- such features may be used in the initial extraction of ceed by generically discussing the extraction of terms. terms or as a post-filtering step. Furthermore, terms In fact, approaches to extract raw candidate terms may be filtered or selected based on appearing in a follow a similar line to that for extracting raw en- particular hierarchical branch of terminology, such as tity mentions in the context of EEL. Generic prepro- terms relating to forms of cancer; these techniques will cessing methods such as stop-word removal, stemming be discussed in the next subsection. and/or lemmatization are often applied, along with to- As explained in AppendixA, producing and main- kenization. Some term extraction methods then rely taining linguistic patterns/rules is a time consuming on window-based methods, extracting n-grams up to task, which in turn results in incomplete rules. Statisti- a predefined length [98]. Other term extractors apply cal measures look more broadly at the usage of a par- POS-tagging and then define shallow syntactic pat- ticular term in a corpus. In terms of such measures, terns to capture, for example, noun phrases (“solar two key properties of terms are often analyzed in this system”), noun phrases prefixed by adjectives (“inner context [60]: unithood and termhood. planets”), and so forth [219,83,60]. Other systems Unithood refers to the cohesiveness of the term use an ensemble of methods to extract a broad selec- as referring to a single concept, which is often as- tion of terms that are subsequently filtered [60]. sessed through analysis of collocations: expressions There are, however, subtle differences when com- where the meaning of each individual word may vary pared with extracting entities, particularly when con- widely from their meaning in the expression such sidering traditional NER scenarios looking for names that the meaning of the word depends directly on the of people, organizations, places, etc.; when extracting expression; an example collocation might be “mean terms, for example, capitalization becomes less useful squared error” where, in particular, the individ- as a signal, and syntactic patterns may need to be more ual words “mean” and “squared” taken in isolation complex to identify concepts such as “inner planets have meanings unrelated to the expression, where the of [the] solar system”. Furthermore, the features phrase “mean squared error” thus has high unithood. considered when filtering term candidates change sig- There are then a variety of measures to detect collo- nificantly when compared with those for filtering en- cations, most based on the idea of comparing the ex- tity mentions (as we will discuss presently). For these pected number of times the collocation would be found reasons, various systems have been proposed specifi- if occurrences of the individual words were indepen- cally for extracting terms, including KEA [315], TEx- dent versus the amount of times the collocation actu- SIS [184], TermeX [74] and YaTeA [8] (here taking ally appears [88]. As an example, Lossio-Ventura et a selection reused by the highlighted systems). Such al. [307] use Web search engines to determine colloca- systems differ from EEL particularly in terms of the tions, where they estimate the unithood of a candidate filtering process, discussed presently. term as the ratio of search results returned for the exact phrase (“mean squared error”), versus the number of 3.2. Filtering results returned for all three terms (mean AND squared AND error). Unithood thus addresses a particular Once a set of candidate terms have been identified, challenge – similar to that of overlapping entities in a range of features can be used for either automatic or EEL (recalling “New York City”) – where extraction J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 31 of partial mentions (such as “squared error”) may lations and structures can be extracted either by anal- lose meaning, particularly if linkable to the KB. ysis of the text itself and/or through the information The second form of statistical measure, called ter- gained from some reference source. These relations mhood, refers to the relevance of the term to the can then be used to filter relevant terminology, or to domain in question. To measure termhood, varia- induce an initial semantic structure that can be formal- tions on the theme of the TF–IDF measure are com- ized as a taxonomy (e.g., expressed in the SKOS stan- monly used [122,172,83,98,60,307], where, for exam- dard [206]), or as a formal ontology (e.g., expressed in ple, terms that appear often in a (domain) specific text the OWL standard [135]), and so forth. As such, this (high TF) but appear less often in a general corpus topic relates heavily to the area of ontology learning, (high IDF) indicate higher termhood. Note that ter- where we refer to the textbook by Cimiano [49] and mhood relates closely to Topic Modeling measures, the more recent survey of Wong et al. [316] for details; where the context of terms is used to find topically- here our goal is to capture the main ideas. related terms; such approaches will be discussed later. In terms of detecting hypernymy from the text it- Other features can rather be contextual, looking at self, a key method relies on distinguishing the head the position of terms in the text [252]; such features term, which signifies the more general hypernym in are particularly important in the context of identifying a (potentially) multi-word term; from modifier terms, keyphrases/terms that capture the domain or topic of a which then specialize the hypernym [305,32,152, given document. The first such feature is known as the 227,6]. For example, the head term of “metastatic phrase depth, which measures how early in the docu- breast cancer” is “cancer”, while “breast” and ment is the first appearance of the term: phrases that “metastatic” are modifiers that specialize the head appear early on (e.g., in the title or first paragraph) are term and successively create hyponyms. As a more deemed to be most relevant to the document or the do- complex example, the head term of “inner planets main it describes. Likewise, terms that appear through- of the solar system” would be “planets”, while out the entire document are considered more relevant: “inner” and “of the solar system” are modify- hence the phrase lifespan – the ratio of the document ing phrases. Analysis of the head/modifier terms thus lying between the first and last occurrence of the term allows for automatic extraction of hypernymic rela- – is also considered as an important feature [252]. tions, starting with the most general head term, such A KB can also be used to filter terms through a link- as “cancer”, and then subsequently extending to hy- ing process. The most simple such procedure is to fil- ponyms by successively adding modifiers appearing in ter terms that cannot be linked to the KB [161]. Other breast cancer proposed methods rather apply a graph-based filtering, a given multi-word term, such as “ ”, metastatic breast cancer where terms are first linked to the KB and then the sub- “ ”, and so forth. graph of the KB induced by the terms is extracted; sub- Of course, analyzing head/modifier terms will miss sequently, terms in the graph that are disconnected [46] hypernyms not involving modifiers, and synonyms; for or exhibiting low centrality [143] can be filtered. This example, the hyponym “carcinoma” of “cancer” is process will be described in more detail later. unlikely to be revealed by such analysis. An alterna- tive approach is to rely on lexico-syntactic patterns to 3.3. Hierarchy Induction detect synonymy or hypernymy. A common set of pat- terns to detect hypernymy are Hearst patterns [129], Often the extracted terms will refer to concepts with which look for certain connectives between noun some semantic relations that are themselves useful to phrases. As an example, such patterns may detect from model as part of the process. The semantic relations the phrase “cancers, such as carcinomas, ...” most often considered are synonymy (e.g., “heart that “carcinoma” is a hyponym of “cancer”. Hearst- attack” and “myocardial infarction” being syn- like patterns are then used by a variety of systems (e.g., onyms) and hypernymy/hyponymy (e.g., “migraine” [55,117,152,161,191,6]). While such patterns can cap- is a hyponym of “headache” with the former being ture additional hypernyms with high precision [129], a more specific form of the latter, while conversely Buitelaar et al. [31] note that finding such patterns in “headache” is a hypernym of “migraine”). While practice is rare and that the approach tends to offer synonymy induces groups of terms with (almost ex- low recall. Hence, approaches have been proposed to actly) the same meaning, hypernymy/hyponymy in- use the vast textual information of the Web to find duces a hierarchical structure over the terms. Such re- instances of such patterns using, for example, Web 32 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web search engines such as Google [50], the abstracts of 3.4. Topic Modeling Wikipedia articles [297], amongst other Web sources. Another approach that potentially offers higher re- While the previous methods are mostly concerned call is to use statistical analyses of large corpora of with extracting a terminology from a corpus that de- text. Many such approaches (e.g., [51,52,58,5]) are scribes a given domain (e.g., for the purposes of build- based, for example, on distributional semantics, which ing a domain-specific ontology), other works are con- aggregates the context (surrounding terms) in which a cerned with modeling and potentially identifying the domain to which the documents in a given corpus given term appears in a large corpus. The distributional pertain (e.g., for the purposes of classifying docu- hypothesis then considers that terms with similar con- ments). We refer to these latter approaches generi- texts are semantically related. Within this grouping of cally as Topic Modeling approaches [176,197]. Such approaches, one can then find more specific strategies approaches are based on analysis of terms (or some- based on various forms of clustering [51], Formal Con- times entities) extracted from the text over which Topic cept Analysis [52], LDA [58], embeddings [5], etc., to Modeling approaches can be applied to cluster and an- find and induce a hierarchy from terms based on their alyze thematically related terms (e.g., “carcinoma”, context. These can then be used as the basis to detect “malignant tumor”, “chemotherapy”). Thereafter, synonyms; or more often to induce a hierarchy of hy- topic labeling [143,4] can be (optionally) applied to pernyms, possibly adding hidden concepts – fresh hy- assign such groupings of terms a suitable KB identifier pernyms of cohyponyms – to create a connected tree referring to the topic in question (e.g., dbr:Cancer). of more/less specific domain terms. Application areas for such techniques include Infor- Of course, reference resources that already contain mation Retrieval [241], Recommender Systems [154], semantic relations between terms can be used to aid Text Classification [142], Cognitive Science [244], and in this process. One important such resource is Word- Social Network Analysis [259], to name but a few. Net [207], which, for a given term, provides a set of For applying Topic Modeling, one can of course first consider directly applying the traditional meth- possible semantic senses in terms of what it might ods proposed in the literature: LSA, pLSA and/or LDA mean (homonymy/polysemy [303]), as well as a set (see AppendixA for discussion). However, these ap- synsets of synonyms called . Those synsets are then proaches have a number of drawbacks. First, such ap- related by various semantic relations, including hy- proaches typically work on individual words and not pernymy, meronymy (part of), etc. WordNet is thus a multi-word terms (though extensions have been pro- useful reference for understanding the semantic rela- posed to consider multi-word terms). Second, topics tions between concepts, used by a variety of systems are considered as latent variables associated with a (e.g., [225,55,325,152], etc.). Other systems rather rely probability of generating words, and thus are not di- on, for example, Wikipedia categorizations [196,6,5] rectly “labeled”, making them difficult to explain or in combination with reference KBs. A core challenge, externalize (though, again, labeled extensions have however, when using such approaches is the problem also been proposed, for example for LDA). Third, of word sense disambiguation [224] (sometimes called words are never semantically interpreted in such mod- the semantic interpretation problem [225]): given a els, but are rather considered as symbolic references term, determine the correct sense in which it is used. over which statistical/probabilistic inference can be We refer to the survey by Navigli [225] for discussion. applied. Hence a number of approaches have emerged that propose to use the structured information avail- An alternative to extracting semantic relations be- able in KBs and/or ontologies to enhance the model- tween terms in the text is to instead rely on the existing ing of topics in text. The starting point for all such ap- relations in a given KB [143,304,46,5]. That is to say, proaches is to extract some terms from the text, using if the terms (or indeed simply entities) can be linked to approaches previously outlined: some rely simply on a suitable existing KB, then semantic relations can be token- or POS-based methods to extract terms, which extracted from the KB itself rather than the text. This can be filtered by frequency or TF–IDF variants to approach is often applied by tools described in the fol- capture domain relevance [149,148,4], whereas others lowing section (e.g., [143,304,46]), whose goal is to rather prefer entity recognition tools (which are sub- understand the domain of a document rather than at- sequently mapped to higher-level topics through rela- tempting to model a domain from text. tions in the KB, as we describe later) [46,169,297]. J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 33
With extracted terms in hand, the next step for many ment is then considered to contain a distribution of (la- approaches – departing from traditional Topic Model- tent) topics, which contains a distribution of (latent) ing – is to link those terms to a given KB, where the concepts, which contains a distribution of (observable) semantic relations of the KB can be exploited to gener- words. Another such hybrid model, but rather based on ate more meaningful topics. The most straightforward pLSA, is proposed by Chen et al. [46] where the prob- such approach is to assume an ontology that offers a ability of a concept mention (or a specific entity men- concept/class hierarchy to which extracted terms from tion42) being generated by a topic is computed based the document are mapped. Thus the ontology can be on the distribution of topics in which the concept or seen as guiding the Topic Modeling process, and in fact entity appears and the same probability for entities that can be used to select a label for the topic. One such ap- are related in the KB (with a given weight). proach is to apply a statistical analysis over the term- The result of these previous methods – applying to-concept mapping. For example, in such a setting, Topic Modeling in conjunction with a KB-linking Jain and Pareek [148] propose the following: for each phase – is a set of topics associated with a set of terms concept in the ontology, count the ratio of extracted that are in turn linked with concepts/entities in the KB. terms mapped to it or its (transitive) sub-concepts, and Interestingly, the links to the KB then facilitate label- take that ratio as an indication of the relevance of the ing each topic by selecting one (or few) core term(s) concept in terms of representing a high-level topic of that help capture or explain the topic. More specifi- the document. Another approach is to consider the cally, a number of graph-based approaches have re- spanning tree(s) induced by the linked terms in the cently been proposed to choose topic labels [149,143, hierarchy, taking the lowest common ancestor(s) as a 4], which typically begin by selecting, for each topic, high-level topic [143]. However, as noted by Hulpu¸s the nodes in the KB that are linked by terms under et al. [143], such approaches relying on class hierar- that topic, and then extracting a sub-graph of the KB chies tend to elect very generic topic labels, where they in the neighborhood of those nodes, where typically give the example of “Barack Obama” being captured the largest connected component is considered to be under the generic concept person, rather than a more the topical/thematic graph [149,4]. The goal, there- interesting concept such as U.S. President. To tackle after, is to select the “label node(s)” that best summa- this problem, a number of approaches have proposed to rize(s) the topic, for which a number of approaches ap- link terms – including entity mentions – to Wikipedia’s ply centrality measures on the topical graph: Janik and categories, from which more fine-grained topic labels Kochut [149] investigate use of a closeness centrality can be selected for a given text [275,145,65]. measure, Allahyari and Kochut [4] propose to use the Other approaches apply traditional Topic Modeling authority score of HITS (later mapping central nodes methods (typically pLSA or LDA) in conjunction with to DBpedia categories), while Hulpu¸s et al. [143] in- information extracted from the KB. Some approaches vestigate various centrality measures, including close- propose to apply Topic Modeling in an initial phase ness, betweenness, information and random-walk vari- directly over the text; for example, Canopy [143] ap- ants, as well as “focused” centrality measures that as- plies LDA over the input documents to group words sign special weights to nodes in the topic (not just into topics and then subsequently links those words in the neighborhood). On the other hand, Varga et with DBpedia for labeling each topic (described later). al. [304] propose to extract a KB sub-graph (from DB- On the other hand, other approaches apply Topic Mod- pedia or Freebase) describing entities linked from the eling after initially linking terms to the KB; for ex- text (containing information about classes, properties ample, Todor et al. [297] first link terms to DBpedia and categories), over which weighting schemes are ap- in order to enrich the text with annotations of types, plied to derive input features for a machine learning categories, hypernyms, etc., where the enriched text model (SVM) that classifies the topics of microposts. is then passed through an LDA process. Some recent approaches rather extend traditional topic models to 3.5. Representation consider information from the KB during the infer- ence of topic-related distributions. Along these lines, Domain knowledge extracted through the previous for example, Allahyari [4] propose an LDA variant, processes may be represented using a variety of Se- called “OntoLDA”, which introduces a latent variable for concepts (taken from DBpedia and linked with the 42They use the term entity to refer to both concepts, such as per- text), which sits between words and topics: a docu- son, and individuals, such as Barack Obama. 34 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web mantic Web formats. In the ontology building pro- sources following the Linked Data principles, combin- cess, induced concepts may be exported to RDF- ing the LEMON, SKOS, and PROV-O vocabularies S/OWL [135] for further reasoning tasks or ontology in their core model; OnLit was proposed by Klimek refinement and development. However, RDFS/OWL et al. [165] as a Linked Data version of the LiDo makes a distinction between concepts and individu- Glossary of Linguistic Terms; etc. For further infor- als that may be inappropriate for certain modeling re- mation, we refer the reader to the editorial by McCrae quirements; for example, while a term such as “US et al. [194], which offers an overview of terminologi- Presidents” could be considered as a sub-topic of cal/linguistic resources published as Linked Data. “US Politics” since any document about the former could be considered also as part of the latter, the for- 3.6. System summary and comparison mer is neither a sub-concept nor an instance of the lat- 43 ter in the set-theoretic setting of OWL. For such sce- Based on the previously discussed techniques, in narios, the Simple Knowledge Organization System Table5, we provide an overview of highlighted CEL (SKOS) [206] was standardized for modeling more systems that deal with the Semantic Web in a direct general forms of conceptual hierarchies, taxonomies, way, and that have a publication offering details; a leg- thesauri, etc., including semantic relations such as end is provided in the caption of the table. Note that broader-than (e.g., hypernym-of), narrow-than (e.g., in the Recognition and Filtering column, some sys- hyponym-of), exact-match (e.g., synonym-of), close- tems delegate these tasks to external recognition tools match (e.g., near-synonym-of), related (e.g., within- – such as DBpedia Spotlight [199], KEA [315], Open- same-topic-as), etc.; the standard also offers properties Calais44, TermeX [74], TermRaider45, TExSIS [184], to define primary labels and aliases for concepts. WikiMiner [209], YaTeA [8] – which are indicated in Aside from these Semantic Web standards, a num- the respective columns as appropriate. ber of other representational formats have been pro- The approaches reviewed in this section might be posed in the literature. The Lexicon Model for On- applied for diverse and heterogeneous cases. Thus, tologies, aka. LEMON [53], was proposed as a for- comparing CEL approaches is not a trivial task; how- mat to associate ontological concepts with richer lin- ever, we can mention some general aspects and con- guistic information, which, on a high level, can be siderations when choosing a particular CEL approach. seen as a model that bridges between the world of for- mal ontologies to the world of natural language (writ- – Target task: As per the Goal column of Table5, ten, spoken, etc.); the core LEMON concept is a lex- the surveyed approaches cover a number of dif- ical entry, which can be a word, affix or phrase (e.g., ferent related tasks with different applications. “cancer”); each lexical entry can have different forms These include Keyphrase Extraction, Ontology (e.g., “cancers”, “cancerous”), and can have multi- Building, Semantic Annotation, Terminology Ex- ple senses (e.g., “ex:cancer_sense1” for medicine, traction, Topic Modeling, etc. An important con- “ex:cancer_sense2” for astrology, etc.); both lexical sideration when choosing an approach is thus to entries and senses can then be linked to their corre- select one that best fits the given application. sponding ontological concepts (or individuals). – Language: Although the approaches described Along related lines, Hellmann et al. [130] propose in this section provide strategies for processing the NLP Interchange Format (NIF), whose goal is text in English, different languages can also be to enhance the interoperability of NLP tools by us- covered in the TE/KE tasks using similar tech- ing an ontology to describe common terms and con- cepts; the format can provide Linked Data as output niques (e.g., the approach proposed by Lemnitzer et al. for further data reuse. Other proposals have also been [172]). Moreover, term translation is the fo- made in terms of publishing linguistic resources as cus of some approaches, such as OntoLearn [305, Linked Data. Cimiano et al. [54] propose such an ap- 225]. Finally, KBs such as DBpedia or BabelNet proach for publishing and linking terminological re- offer multilingual resources that can support the extraction of terms in varied languages.
43It is worth noting that OWL does provide means for meta- modeling (aka. punning), where concepts can be simultaneous con- 44http://www.opencalais.com/ sidered as groups of individuals when reasoning at a terminological 45https://gate.ac.uk/projects/neon/termraider. level, and as individuals when reasoning at an assertional level. html J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 35
Table 5 Overview of Concept Extraction & Linking systems Setting denotes the primary domain in which experiments are conducted; Goal indicates the stated aim of the system (KE: Keyphrase Extraction, OB: Ontology Building, SA: Semantic Annotation, TE: Terminology Extraction, TM: Topic Modeling); Recognition summarizes the term extraction method used; Filtering summarizes methods used to select suitable domain terms; Relations indicates the semantics relations extracted between terms in the text (Hyp.: Hyponyms, Syn.: Synonyms, Mer.: Meronyms); Linking indicates KBs/ontologies to which terms are linked; Reference indicates other sources used; Topic indicates the method(s) used to to model (and label) topics. ’—’ denotes no information found, not used or not applicable. Name Year Setting Goal Recognition Filtering Relations Linking Reference Topic
AllahyariK [4] 2015 Multi-domain TM Token-based Statistical — DBpedia Wikipedia LDA / Graph
Canopy [143] 2013 Multi-domain TM — — — DBpedia Wikipedia LDA / Graph
CardilloWRJVS [36] 2013 Medicine TE TExSIS Manual — SNOMED, DBpedia ——
ChemuduguntaHSS [43] 2008 Science SA Token-based Statistical — — CIDE, ODP LDA
ChenJYYZ [46] 2016 Comp. Sci., News TM DBpedia Spotlight Graph/Stat. — DBpedia — pLSA / Graph
CimianoHS [52] 2005 Tourism, Finance OB POS-based Statistical Hyp. — — —
CRCTOL [152] 2010 Terrorism, Sports OB POS-based Statistical Hyp. — WordNet —
Distiller [223] 2014 — KE Stat./Lexical Hybrid — DBpedia ——
DolbyFKSS [83] 2009 I.T., Energy TE POS-based Statistical — DBpedia, Freebase ——
F-STEP [196] 2013 News TE WikiMiner WikiMiner Hyp. DBpedia, Freebase Wikipedia —
FGKBTE [98] 2014 Crime, Terrorism TE Token-based Statistical — FunGramKB — —
GillamTA [117] 2005 Nanotechnology OB Patterns Hybrid Hyp. — — —
GullaBI [122] 2006 Petroleum OB POS-based Statistical — — — —
JainP [148] 2010 Comp. Sci. TM POS-based Stat. / Manual — Custom onto. — Hierarchy
JanikK [149] 2008 News TM — Statistical — Wikipedia — Graph
LauscherNRP [169] 2016 Politics TM DBpedia Spotlight Statistical — DBpedia — L-LDA
LemnitzerVKSECM [172] 2007 E-Learning OB POS-based Statistical Hyp. OntoWordNet Web search —
LiTeWi [60] 2016 Software, Science OB Ensemble WikiMiner — Wikipedia — —
LossioJRT [307] 2016 Biomedical TE POS-based Statistical — UMLS, SNOMED Web search —
MoriMIF [214] 2004 Social Data KE TermeX Statistical — — Google —
MuñozGCHN [219] 2011 Telecoms KE POS-based Hybrid — DBpedia Wikipedia —
OntoLearn [305,225] 2001 Tourism OB POS-based Statistical Various — WordNet, Google —
OntoLT [32] 2004 Neurology OB POS-based Statistical Hyp. Any — —
OSEE [160,161] 2012 Bioinformatics SA POS-based KB-based — Gene Ontology Various (OBO) —
OwlExporter [314] 2010 Software OB POS-based KB-based — Any — —
PIRATES [252] 2010 Software SA KEA Hybrid — SEOntology — —
SPRAT [191] 2009 Fishery OB Patterns TermRaider Syn. Hyp. FAO WordNet —
TaxoEmbed [5] 2016 Multi-domain OB — KB-based Hyp. Wikidata, YAGO Wikipedia, BabelNet
Text2Onto [185,55] 2005 — OB POS/JAPE Statistical Hyp. Mer. — WordNet —
TodorLAP [297] 2016 News TM DBpedia Spotlight — Hyp. DBpedia Wikipedia LDA
TyDI [227] 2010 Biotechnology OB — YaTeA Syn. Hyp. — — —
VargaCRCH [304] 2014 Microposts TM OpenCalais —— DBpedia, Freebase — Graph / ML
ZhangYT [325] 2009 News TE Manual Statistical — — WordNet — 36 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
– Output quality. The quality of CEL tasks depends not reported in the CEL papers surveyed, which on numerous factors, such as the ontologies and rather focus on metrics to assess output quality. datasets used, the manual intervention involved We can conclude that comparing CEL approaches in their processes, etc. For example, approaches is complicated not only by the diversity of methods such as OSEE [160,161], OntoLearn [305,225], proposed and the goals targeted, but also by a lack of or that proposed by Chemudugunta et al. [43], standardized, comparative evaluation frameworks; we rely on ontologies for recognizing or filtering will discuss this issue in the following subsection. terms; while this strategy could provide an in- creased precision, new terms may not be iden- 3.7. Evaluation tified at all and thus, a low recall may be pro- duced. On the other hand, approaches by Cardillo Given the diversity of approaches gathered together et al. [36] and Zhang et al. [325] involve man- in this section, we remark that the evaluation strate- ual intervention; although this is a costly process, gies employed are likewise equally diverse. In par- it can help ensure a higher quality result over ticular, evaluation varies depending on the particular smaller input corpora of text. task considered (be it TE, KE, TM or some combina- – Domain: Although a specific domain is com- tion thereof) and the particular application in mind (be monly used for testing (e.g. Biomedical, Finance, it ontology building, text classification, etc.). Evalua- News, Terrorism, etc.), CEL approaches rely on tion in such contexts is often further complicated by NLP tools that can be employed for varied do- the potentially subjective nature of the goal of such approaches. When assessing the quality of the out- mains in a general fashion. However, some CEL put, some questions may be straightforward to answer, approaches may be built with a specific KB in such as: Is this phrase a cohesive term (unithood/pre- mind. For example, Cardillo et al. [36] and Los- cision)? On the other hand, evaluation must somehow sio et al. [307] use SNOMED-CT for the medical deal with more subjective domain-related questions, domain, while F-STEP [196], Distiller [223], and such as: Is this a domain-relevant term (termhood/pre- Dolby et al. [83] use DBpedia. On the other hand, cision)? Have we captured all domain-relevant terms KB-agnostic approaches such as OntoLT [32] and appearing in the text (recall)? Is this taxonomy of OwlExporter [314] generalize to any KB/domain. terms correct (precision)? Does this label represent the – Text characteristics/Recognition. Different fea- terms forming this topic (precision)? Does this docu- tures can be used during the extraction and filter- ment have these topics (precision)? Are all topics of the ing of terms and topics. For example, some sys- document captured (recall)? And so forth. Such ques- tems deal with the recognition of multi-word ex- tions are inherently subjective, may raise disagreement pressions (e.g., OntoLearn [305,225]), or contex- amongst human evaluators [172], and may require ex- tual features provided by the FCA (as proposed pertise in the given domain to answer adequately. by Cimiano et al. [52]) or position of words in a text (e.g., PIRATES [252]). Such a selection Datasets: For evaluating CEL approaches, notably of features may influence the results in different there are many Web-based corpora that have been pre-classified with topics or keywords, often anno- ways for particular applications; it may be diffi- tated by human experts – such as users, moderators cult to anticipate a priori how such factors may or curators of a particular site – through, for exam- influence an application, where it may be best to ple, tagging systems. These can be reused for evalu- evaluate such approaches for a particular setting. ation of domain extraction tools, in particular to see – Efficiency and scalability. When faced with a if automated approaches can recreate the high-quality large input corpus, the efficiency of a CEL ap- classifications or topic models inherent in such cor- proach can become a major factor. Some CEL ap- pora. Some such corpora that have been used include: proaches rely on computationally-expensive NLP BBC News46 [143,297], British Academic Written tasks (e.g., deep parsing), while other approaches Corpus47 [143,4], British National Corpus48 [52], rely on more lightweight statistical tasks to ex-
tract and filter terms. Further steps to extract a hi- 46 erarchy or link terms with a KB may introduce http://mlg.ucd.ie/datasets/bbc.html 47http://www2.warwick.ac.uk/fac/soc/al/research/ a further computational cost. Unfortunately how- collections/bawe/ ever, efficiency (in terms of runtimes) is generally 48http://www.natcorp.ox.ac.uk/ J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 37
CNN News [149], DBLP49 [46], eBay50 [308], En- topic is considered coherent if it covers a high ratio of ron Emails51, Twenty Newsgroups52 [46,297], Reuters the words/terms appearing in the textual context from News53 [173,52,325], StackExchange54 [143], Web which it was extracted (measured by metrics such as News [297], Wikipedia categories [65], Yahoo! cate- Pointwise Mutual Information (PMI) [60], Normalized gories [296], and so forth. Other datasets have been Mutual Information (NMI), etc.). On the other hand, specifically created as benchmarks for such approaches. for Terminology Extraction approaches with hierarchy In the context of KE, for example, gold standards such induction, measures such as Semantic Cotopy [52] and as SemEval [162] and Crowd500 [188] have been cre- Taxonomic Overlap [52] may be used to compare the ated, with the latter being produced through a crowd- output hierarchy with that of a gold-standard ontology. sourcing process; such datasets were used by Gagnon In cases where a fixed number of users/judges are in et al. [109] and Jean-Louis et al. [150] for KE-related charge of developing a gold-standard dataset or assess- third-party evaluations. In the context of TE, existing ing the output of systems in an a posteriori manner, domain ontologies can be used – or manually created the agreement among them is expressed as Cohen’s or and linked with text – to serve as a gold standard for Fleiss’ κ-measure. In this sense, Randolph [254] pro- evaluation [32,52,325,227,161,27]. vides a typical description (and usage examples) of the Rather than employing a pre-annotated gold stan- κ-measure, where the considered aspects are the num- dard, an alternative strategy is to apply the approach ber of cases or instances to evaluate, the number of hu- under evaluation to a non-annotated corpus and there- man judges, the number of categories, and the number after seek human judgment on the output, typically in of judges who assigned the case to the same category. comparison with baseline approaches from the litera- Third-party comparisons: To the best of our knowl- ture. Such an approach is employed in the context of edge, there has been little work on third-party eval- TE by Nédellec et al. [227], Kim and Tuan [161], and uations for comparing CEL approaches, perhaps be- Dolby et al. [83]; or in the context of KE by Muñoz- cause of the diversity of goals and methods applied, García et al. [219]; or in the context of TM by Hulpu¸s the aforementioned challenges associated with evalu- et al. [143] and Lauscher et al. [169]; and so forth. ating CEL systems, etc. Among the available results, In such evaluations, TM approaches are often com- Gangemi [110] compared three commercial tools – pared against traditional approaches such as LDA [4], Alchemy, OpenCalais and PoolParty – for Topic Ex- PLSA [46], hierarchical clustering [126], etc. traction, where Alchemy had the highest F-measure. Metrics: The most commonly used metrics for eval- In a separate evaluation for Terminology Extrac- tion – comparing Alchemy, CiceroLite, FOX [288], uating CEL approaches are precision, recall, and F1 measure. When comparing with baseline approaches FRED [111] and Wikimeta – Gangemi [110] reported 55 over a priori non-annotated corpora, recall can be a that FRED [111] had the highest F-measure. manually-costly aspect to assess; hence comparative rather than absolute measures such as relative recall 3.8. Summary are sometimes used [161]. In cases where results are ranked, precision@k measures may be used to assess In this section, we gather together three approaches the quality of the top-k terms extracted [307]. for extracting domain-related concepts from a text: Particular tasks may be associated with specialized Terminology Extraction (TE), Keyphrase Extraction measures. Evaluations in the area of Topic Model- (KE), and Topic Modeling (TM). While the first task ing may also consider perplexity (the log-likelihood of is typically concerned with applications involving on- tology building, or otherwise extracting a domain- a held-out test set) and/or coherence [265], where a specific terminology from an appropriate corpus, the latter two tasks are typically concerned with under- 49http://dblp.uni-trier.de/db/ standing the domain of a given text. As we have seen in 50http://www.ebay.com 51https://www.cs.cmu.edu/~./enron/ 52https://archive.ics.uci.edu/ml/datasets/Twenty+ 55We do not include Alchemy, CiceroLite, nor Wikimeta in the Newsgroups discussion of Table5 since we have not found any publication 53http://www.daviddlewis.com/resources/ providing details on their implementation. Though FOX [288] and testcollections/reuters21578/ and http://www. FRED [111] do have publications, few details are provided on how daviddlewis.com/resources/testcollections/rcv1/ CEL is implemented. We will, however, discuss FRED [111] in the 54https://archive.org/details/stackexchange context of Relation Extraction in Section4. 38 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web this section, all tasks relate in important aspects, par- with respect to how these tasks can be general- ticularly in the identification of domain-relevant con- ized, how approaches to each task can potentially cepts; indeed, TE and TM further share the goal of ex- complement each other or, conversely, where the tracting relationships between the extracted terms, be it objectives of such tasks necessarily diverge and to induce a hierarchy of hypernyms, to find synonyms, require distinct techniques to solve. or to find thematic clusters of terms. – Specialized Settings/Multilingual CEL: As was While all of the discussed approaches rely – to vary- previously discussed for EEL, most approaches ing degrees – on techniques proposed in the traditional for CEL consider complete texts of high qual- IE/NLP literature for such tasks, the use of reference ity, such as technical documents, papers, etc. On ontologies and/or KBs has proven useful for a number the other hand, CEL has not been well explored of technical aspects inherent to these tasks: in other settings, such as online discussion fora, Twitter, etc., which may present a different set of – using the entity labels and aliases of a domain- challenges. Likewise, most focus has been on En- specific KB as a dictionary to guide the extraction glish texts, where works such as LEMON [53] of conceptual domain terms in a text [161,307]; highlight the impact that approaches considering – linking terms to KB concepts and using the se- multiple languages could have. mantic relations in the KB to find (un)related – Contextual Information: As previously discussed, terms [149,143,196,4,46]; TE, KE, and TM can be used to support the ex- – enriching text with additional information taken traction of topics from text. As a consequence, from the KB [297]; such topics can be used to enrich the input text – classifying text with respect to an ontological and existing KBs (such as DBpedia). This would concept hierarchy [148]; be useful in scenarios where input documents – building topic models that include semantic rela- need to include further contextual information or tions from the KB [4,46]; to be organized into (potentially new) categories. – determining topic labels/identifiers based on the – Crowdsourcing: A challenging aspect of CEL is centrality of nodes in the KB graph [143,4,46]; the inherent subjectivity with respect to evaluat- – representing and integrating terminological knowl- ing the output of such methods. Hence some ap- edge [206,53,130,36,54,194]. proaches have proposed to leverage crowdsourc- On the other hand, such processes are also often used ing, amongst which we can mention the creation to create or otherwise enrich Semantic Web resources, of the Crowd500 dataset [188] for evaluating KE such as for ontology building applications, where TE approaches. An open question is to then fur- can be used to extract a set of domain-relevant con- ther develop on this idea and consider leveraging cepts – and possibly some semantic relations between crowdsourcing for solving specific sub-tasks of them – to either seed or enhance the creation of a CEL or to support evaluation, including for other domain-specific ontology [305,225,51,55,49,316]. tasks relating to TE or TM (though of course such an approach may not be suitable for specialist do- 3.9. Open Questions mains requiring expert knowledge). – Linking: While a variety of approaches have been Although the tasks involved in CEL are used in var- proposed to extract terminology/keyphrases from ied applications and domains, some general open ques- text, only more recently has there been a trend tions can be abstracted from the previous discussion, towards linking such mentions with a KB [149, where we highlight the following: 161,143,4,46]. While such linking could be seen as a form of EEL, and as being related to word – Interrelation of Tasks: Under the heading of CEL, sense disambiguation, it is not necessarily sub- in this survey we have grouped three main tasks: sumed by either: many EEL approaches tend to Terminology Extraction (TE), Keyphrase Extrac- focus on mentions of named entities, while word tion (KE) and Topic Modeling (TM). These tasks sense disambiguation does not typically consider are often considered in isolation – sometimes by multi-term phrases. Hence, an interesting subject different communities and for different applica- is to explore custom techniques for linking the tions – but as per our discussion, they also share conceptual terminology/keyphrases produced by clear overlaps. An important open question is thus the TE and KE processes to a given KB, which J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 39
may include, for example, synonym expansion, Listing 3: Relation Extraction and Linking example domain-specific filtering, etc. Input: Bryan Lee Cranston is an American actor. He – Benchmark Framework: In Section 3.7, we dis- ,→ is known for portraying ''Walter White '' in cussed the diverse ways in which CEL approaches ,→ the drama series Breaking Bad.
are evaluated. While a number of gold stan- Output: dard datasets have been proposed for specific dbr:Bryan_Cranston dbo:occupation dbr:Actor ; dbo:birthPlace dbr:United_States . tasks [162,188], we did not encounter much dbr:Walter_White_(Breaking_Bad) dbo:portrayer dbr: reuse of such datasets, where the only systematic ,→ Bryan_Cranston ; third-party evaluation we could find for such ap- dbo:series dbr:Breaking_Bad . dbr:Breaking_Bad dbo:genre dbr:Drama . proaches was that conducted by Gangemi [110]. We thus observe a strong need for a standardized benchmarking framework for CEL approaches, with appropriate metrics and datasets. Such a Example: In Listing3, we provide a hypothetical framework would need to address a number of (and rather optimistic) example of REL with respect key open questions relating to the diversity of ap- to DBpedia. Given a textual statement, the output pro- proaches that CEL encompasses, as well as the vides an RDF representation of entities interacting subjectivity inherent in its goals (e.g., deciding through relationships associated with properties of an what terminology is important to a domain, what ontology. Note that there may be further information in keyphrases are important to a document, etc.). the input not represented in the output, such as that the entity “Bryan Lee Cranston” is a person, or that he is particularly known for portraying “Walter White”; 4. Relation Extraction & Linking the number and nature of the facts extracted depend At the heart of any Semantic Web KB are relations on many factors, such as the domain, the ontology/KB between entities [278]. Thus an important traditional used, the techniques employed, etc. IE task in the context of the Semantic Web is Relation Note that Listing3 exemplifies direct binary re- Extraction (RE), which is the process of finding rela- lations. Many REL systems rather extract n-ary re- tionships between entities in the text. Unlike the tasks lations with generic role-based connectives. In List- of Terminology Extraction or Topic Modeling that aim ing4, we provide a real-world example given by 56 to extract fixed relationships between concepts (e.g., the online FRED demo; for brevity, we exclude hypernymy, synonymy, relatedness, etc.), RE aims to some output triples not directly pertinent to the ex- extract instances of a broader range of relations be- ample. Here we see that relations are rather repre- tween entities (e.g., born-in, married-to, interacts-with, sented in an n-ary format, where for example, the etc.). Relations extracted may be binary relations or relation “portrays” is represented as an RDF re- even higher arity n-ary relations. When the predicate source connected to the relevant entities by role-based of the relation – and the entities involved in it – are predicates, where “Walter White” is given by the linked to a KB, or when appropriate identifiers are cre- predicate dul:associatedWith,“Breaking Bad” is ated for the predicate and entities, the results can be given by the predicate fred:in, and “Bryan Lee used to (further) populate the KB with new facts. How- Cranston” is given by an indirect path of three predi- ever, first it is also necessary to represent the output cates vn.role:Agent/fred:for−/vn.role:Theme; of the Relation Extraction process as RDF: while bi- the type of relation is then given as fred:Portray. nary relations can be represented directly as triples, n- Applications: One of the main applications for REL ary relations require some form of reified model to en- 57 code. By Relation Extraction and Linking (REL), we is KB Population, where relations extracted from then refer to the process of extracting and representing text are added to the KB. For example, REL has been relations from unstructured text, subsequently linking applied to (further) populate KBs in the domains of their elements to the properties and entities of a KB. Medicine [272,236], Terrorism [147], Sports [260], In this section, we thus discuss approaches for ex- among others. Another important application for REL tracting relations from text and linking their con- is to perform Structured Discourse Representation, stituent predicates and entities to a given KB; we also discuss representations used to encode REL results as 56http://wit.istc.cnr.it/stlab-tools/fred/demo RDF for subsequent inclusion in the KB. 57Also known as Ontology Population [49,191,314,61,57] 40 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
Listing 4: FRED Relation Extraction example or equivalence. Fifth, REL approaches – particularly those focused on extracting n-ary relations – must de- Input: Bryan Lee Cranston is an American actor. He ,→ is known for portraying "Walter White" in fine an appropriate RDF representation to serialize out- ,→ the drama series Breaking Bad. put relations. Finally, in order to link the resulting rela- Output(sample): tions to a given KB/ontology, REL often considers an fred:Bryan_lee_cranston a fred:AmericanActor ; explicit mapping step to align identifiers. dul:hasQuality fred:Male . fred:AmericanActor rdfs:subClassOf fred:Actor ; dul:hasQuality fred:American . 4.1. Entity Extraction (and Linking) fred:know_1 a fred:Know ; vn.role:Theme fred:Bryan_lee_cranston ; fred:for fred:thing_1 . The first step of REL often consists of identifying fred:portray_1 a fred:Portray . vn.role:Agent fred:thing_1 ; entity mentions in the text. Here we distinguish three dul:associatedWith fred:walter_white_1 ; strategies, where, in general, many works follow the fred:in fred:Breaking_Bad . fred:Breaking_Bad a fred:DramaSeries . EEL techniques previously discussed in Section2, par- fred:DramaSeries dul:associatedWith fred:Drama ; ticularly those adopting the first two strategies. rdfs:subClassOf fred:Series . – The first strategy is to employ an end-to-end EEL system – such as DBpedia Spotlight [199], Wik- ifier [95], etc. – to match entities in the raw text. where arguments implicit in a text are parsed and po- The benefit of this strategy is that KB identifiers tentially linked with an ontology or KB to explic- are directly identified for subsequent phases. itly represent their structure [107,11,111]. REL is also – The second strategy is to employ a traditional used for Question Answering (Q&A) [299], whose NER tool – often from Stanford CoreNLP [10, purpose is to answer natural language questions over 233,111,212,222] – and potentially link the re- KBs, where approaches often begin by applying REL sulting mentions to a KB. This strategy has the on the question text to gain an initial structure [333, benefit of being able to identify mentions of 317]. Other interesting applications have been to mine emerging entities, allowing to extract relations deductive inference rules from text [177], or for pat- about entities not already in the KB. tern recognition over text [119], or to verify or provide – The third strategy is to rather skip the NER/EL textual references for existing KB triples [104]. phrase and rather directly apply an off-the-shelf Process: The REL process can vary depending on RE/OpenIE tool or an existing dependency-based the particular methodology adopted. Some systems parser (discussed later) over the raw text to extract rely on traditional RE processes (e.g., [90,91,222]), relational structures; such structures then embed where extracted relations are linked to a KB after ex- parsed entity mentions over which EEL can be traction; other REL systems – such as those based on applied (potentially using an existing EEL system distant supervision – use binary relations in the KB such as DBpedia Spotlight [107] or TagMe [99]). to identify and generalize patterns from text mention- This has the benefit of using established RE tech- ing the entities involved, which are then used to sub- niques and potentially capturing emerging enti- sequently extract and link further relations. Generaliz- ties; however, such a strategy does not leverage ing, we structure this section as follows. First, many knowledge of existing relations in the KB for ex- (mostly distant supervision) REL approaches begin by tracting relation mentions (since relations are ex- identifying named entities in the text, either through tracted prior to accessing the KB). NER (generating raw mentions) or through EEL (ad- In summary, entities may be extracted and linked be- ditionally providing KB identifiers). Second, REL re- fore or after relations are extracted. Processing entities quires a method for parsing relations from text, which before relations can help to filter sentences that do not in some cases may involve using a traditional RE ap- involve relationships between known entities, to find proach. Third, distant supervision REL approaches use examples of sentences expressing known relations in existing KB relations to find and learn from example the KB for training purposes, etc. On the other hand, relation mentions, from which general patterns and/or processing entities after relations allows direct use of features are extracted and used to generate novel rela- traditional RE and OpenIE tools, and may help to ex- tions. Fourth, an REL approach may apply a cluster- tract more complex (e.g., n-ary) relations involving en- ing procedure to group relations based on hypernymy tities that are not supported by a particular EEL ap- J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 41 proach (e.g., emerging entities), etc. These issues will dependency parser to extract an initial syntactic struc- be discussed further in the sections that follow. ture from relation mentions include DeepDive [233], In the context of REL, when extracting and link- PATTY [222], Refractive [96] and works by Mintz et ing entities, Coreference Resolution (CR) plays a very al. [212], Nguyen and Moschitti [232], etc. important role. While other EEL applications may Other works rather apply higher-level theories of not require capturing every coreference in a text – language understanding to the problem of modeling e.g., it may be sufficient to capture that the entity is relations. One such theory is that of frame seman- mentioned at least one in a document for semantic tics [101], which considers that people understand sen- search or annotation tasks – in the context of REL, tences by recalling familiar structures evoked by a par- not capturing coreferences will potentially lose many ticular word; a common example is that of the term relations. Consider again Listing3, where the second “revenge”, which evokes a structure involving vari- sentence begins “He is known for ...”; CR is ous constituents, including the avenger, the retribu- necessary to understand that “He” refers to “Bryan tion, the target of revenge, the original victim being Lee Cranston”, to extract that he portrays “Walter avenged, and the original offense. These structures are White”, and so forth. In Listing3, the portrays re- then formally encoded as frames, categorized by the lation is connected (indirectly) to the node identify- word senses that evoke the frame, encapsulating the ing “Bryan Lee Cranston”; this is possible because constituents as frame elements. Various collections of FRED uses Stanford CoreNLP’s CR methods. A num- frames have then been defined – with FrameNet [14] ber of other REL systems [95,114] likewise apply CR being a prominent example – to help identify frames to improve recall of relations extracted. and annotate frame elements in text. Such frames can be used to parse n-ary relations, as used for example by 4.2. Parsing Relations Refractive [96], PIKES [61] or Fact Extractor [104]. A related theory used to parse complex n-ary re- The next phase of REL systems often involves pars- lations is that of Discourse Representation Theory ing structured descriptions from relation mentions in (DRT) [155], which offers a more logic-based perspec- the text. The complexity of such structures can vary tive for reasoning about language. In particular, DRT is widely depending on the nature of the relation men- based on the idea of Discourse Representation Struc- tion, the particular theory by which the mention is tures (DRS), which offer a first-order-logic (FOL) style parsed, the use of pronouns, and so forth. In particu- representation of the claims made in language, incor- lar, while some tools rather extract simple binary re- porating n-ary relations, and even allowing negation, lations of the form p(s,o) with a designated subject– disjunction, equalities, and implication. The core idea predicate–object, others may apply a more abstract se- is to build up a formal encoding of the claims made mantic representation of n-ary relations with various in a discourse spanning multiple sentences where the dependent terms playing various roles. equality operator, in particular, is used to model coref- In terms of parsing more simple binary relations, as erence across sentences. These FOL style formulae are mentioned previously, a number of tools use existing contextualized as boxes that indicate conjunction. OpenIE systems, which apply a recursive extraction Tools such as Boxer [28] then allow for extract- of relations from webpages, where extracted relations ing such DRS “boxes” following a neo-Davidsonian are used to guide the process of extracting further re- representation, which at its core involves describing lations. In this setting, for example, Dutta et al. [91] events. Consider the example sentence “Barack Obama use NELL [213] and ReVerb [97], Liu et al. [181] use met Raul Castro in Cuba”; we could consider rep- PATTY [222], while Soderland and Mandhani [284] resenting this as meet(BO, RC, CU) with BO denoting use TextRunner [16] to extract relations; these rela- “Barack Obama”, etc.58 Now consider “Barack Obama tions will later be linked with an ontology or KB. met with Raul Castro in 2016”; if we represent In terms of parsing potentially more complex n-ary this as meet(BO, RC,), we see that the meaning of relations, a variety of methods can be applied. A pop- the third argument conflicts with the role of CU ular method is to begin with a dependency-based parse as a location earlier even though both are prefixed by of the relation mention. For example, Grafia [107] uses a Stanford PCFG parser to extract dependencies in a 58Here we use a rather distinct representation of arguments in the relation mention, over which CR and EEL are subse- relation for space/visual reasons and to follow the notation used by quently applied. Likewise, other approaches using a Boxer (which is based on variables). 42 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web the preposition “in”. Instead, we will create an exis- flurry of later refinements and extensions. The core tential operator to represent the meeting; considering hypothesis behind this method is that given two en- “Barack Obama met briefly with Raul Castro tities with a known relation in the KB, sentences in 2016 while in Cuba”, we could write (e.g.): in which both entities are mentioned in a text are likely to also mention the relation. Hence, given a KB ∃e :meet(e),Agent(e, BO),CoAgent(e, RC), predicate (e.g., dbo:genre), we can consider the set of known binary relations between pairs of entities briefly(e),Theme(e, CU),Time(e,) from the KB (e.g, (dbr:Breaking_Bad,dbr:Drama), (dbr:X_Files,dbr:Science_Fiction), etc.) with where e denotes the event being described, essentially that predicate and look for sentences that mention both decomposing the complex n-ary relation into a con- entities, hypothesizing that the sentence offers an ex- junction of unary and binary relations.59 Note that ex- ample of a mention of that relation (e.g., “in the pressions such as Agent(e, BO) are considered as se- drama series Breaking Bad”, or “one of the most mantic roles, contrasted with syntactic roles; if we con- popular Sci-Fi shows was X-Files”). From such sider “Barack Obama met with Raul Castro”, then examples, patterns and features can be generalized to BO has the syntactic role of subject and RC the role find fresh KB relations involving other entities in sim- of object, but if we swap – “Raul Castro met with ilar such mentions appearing in the text. Barack Obama” – while the syntactic roles swap, we The first step for distant supervision methods is to see little difference in semantic meaning: both BO and find sentences containing mentions of two entities that RC play the semantic role of (co-)agents in the event. have a known binary relation in the KB. This step The roles played by members in an event denoted by essentially relies on the EEL process described ear- a verb are then given by various syntactic databases, lier and can draw on techniques from Section2. Note such as VerbNet [164]60 and PropBank [163]. The that examples may be drawn from external documents, Boxer [28] tool then uses VerbNet to create DRS-style where, for example, Sar-graphs [166] proposes to use boxes encoding such neo-Davidsonian representations Bing’s Web search to find documents containing both of events denoted by verbs. In turn, REL tools such entities in an effort to build a large collection of ex- as LODifier [11] and FRED [111] (see Listing4) use ample mentions for known KB relations. In particular, Boxer to extract relations encoded in these DRS boxes. being able to draw from more examples allows for in- creasing the precision and recall of the REL process by 4.3. Distant Supervision finding better quality examples for training [233]. Once a list of sentences containing pairs of enti- There are a number of significant practical short- ties is extracted, these sentences need to be analyzed comings of using resources such as FrameNet, Verb- to extract patterns and/or features that can be applied Net, and PropBank to extract relations. First, being to other sentences. For example, as a set of lexical manually-crafted, they are not necessarily complete features et al. for all possible relations and syntactic patterns that one , Mintz [212] propose to use the se- might consider and, indeed, are often only available in quence of words between the two entities, to the left English. Second, the parsing method involved may be of the first entity and to the right of the second en- quite costly to run over all sentences in a very large tity; a flag to denote which entity came first; and a set corpus. Third, the relations extracted are complex and of POS tags. Other features proposed in the literature may not conform to the typically binary relations in the include matching the label of the KB property (e.g., KB; creating a posteriori mappings may be non-trivial. dbo:birthPlace – "Birth Place") and the relation was An alternative data-driven method for extracting re- mention for the associated pair of entities (e.g., “ born in lations – based on distant supervision61 – has thus ”) [181]; the number of words between the two n become increasingly popular in recent years, with entity mentions [115]; the frequency of -grams ap- a seminal work by Mintz et al. [212] leading to a pearing in the text window surrounding both entities, where more frequent n-grams (e.g., “was born in”) are indicative of general patterns rather than specific 59 One may note that this is analogous to the same process of rep- details for the relation of a particular pair of entities n resenting -ary relations in RDF [133]. prematurely in a taxi 60See https://verbs.colorado.edu/verb-index/vn/ (e.g., “ ”) [222], etc. meet-36.3.php Aside from shallow lexical features, systems often 61Also known as weak supervision [140]. parse the example relations to extract syntactic depen- J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 43 dencies between the entities. A common method, again are necessarily incomplete and thus should be inter- proposed by Mintz et al. [212] in the context of su- preted under an Open World Assumption (OWA): just pervision, is to consider dependency paths, which are because a relation is not in a KB, it does not mean (shortest) paths in the dependency parse tree between that it is not true. Likewise, for a relation mention in- the two entities; they also propose to include window volving a pair of entities, if that pair does not have a nodes – terms on either side of the dependency path given relation in the KB, it should not be considered – as a syntactic feature to capture more context. Both as a negative example for training. Hence, to gener- the lexical and syntactic features proposed by Mintz et ate useful negative examples for training, the approach al. were then reused in a variety of subsequent related by Surdeanu et al. [291], Knowledge Vault [84], the works using distant supervision, including Knowledge approach by Min et al. [210], etc., propose a heuris- Vault [84], DeepDive [233], and many more besides. tic called (in [84]) a Local Closed World Assumption Once a set of features is extracted from the relation (LCWA), which assumes that if a relation p(s,o) exists mentions for pairs of entities with a known KB rela- in the KB, then any relation p(s,o0) not in the KB is tion, the next step is to generalize and apply those fea- a negative example; e.g., if born(BO, US) exists in the tures for other sentences in the text. Mintz et al. [212] KB, then born(BO, X) should be considered a negative originally proposed to use a multi-class logistic regres- example assuming it is not in the KB. While obviously sion classifier: for training, the approach extracts all this is far from infallible – working well for functional- features for a given entity pair (with a known KB re- esque properties like capital but less well for often lation) across all sentences in which that pair appears multi-valued properties like child – it has proven use- together, which are used to train the classifier for the ful in practice [291,210,84]; even if it produces false original KB relation; for classification, all entities are negative examples, it will produce far fewer than con- identified by Stanford NER, and for each pair of enti- sidering any relation not in the KB as false, and the ties appearing together in some sentence, the same fea- benefit of having true negative examples amortizes the tures are extracted from all such sentences and passed cost of potentially producing false negatives. to the classifier to predict a KB relation between them. A further complication in distant supervision is with A variety of works followed up on and further re- respect to noise in automatically labeled relation men- fined this idea. For example, Riedel et al. [257] note tions caused, for example, by incorrect EEL results that many sentences containing the entity pair will not where entity mentions are linked to an incorrect KB express the KB relation and that a significant percent- identifier. To tackle this issue, a number of DS-based age of entity pairs will have multiple KB relations; approaches include a seed selection process to try to hence combining features for all sentences containing select high-quality examples and reduce noise in la- the entity pair produces noise. To address this issue, bels. Along these lines, for example, Augenstein et they propose an inference model based on the assump- al. [10] propose to filter DS-labeled examples involv- tion that, for a given KB relation between two entities, ing ambiguous entities; for example, the relation men- at least one sentence (rather than all) will constitute a tion “New York is a state in the U.S.” may true mention of the relation; this is realized by intro- be discarded since “New York” could be mistakenly ducing a set of binary latent variables for each such linked to the KB identifier for the city, which may lead sentence to predict whether or not that sentence ex- to a noisy example for a KB property such as has-city. presses the relation. Subsequently, for the MultiR sys- Other approaches based on distant supervision rather tem, Hoffman et al. [140] proposed a model further propose to extract generalized patterns from rela- taking into consideration that relation mentions may tion mentions for known KB relations. Such sys- overlap, meaning that a given mention may simultane- tems include BOA [115] and PATTY [222], which ex- ously refer to multiple KB relations; this idea was later tract sequences of tokens between entity pairs with refined by Surdeanu et al. [291], who proposed a simi- a known KB relation, replacing the entity pairs with lar multi-instance multi-label (MIML-RE) model cap- (typed) variables to create generalized patterns as- turing the idea that a pair of entities may have multiple sociated with that relation, extracting features used relations (labels) in the KB and may be associated with to filter low-quality patterns; an example pattern in multiple relation mentions (instances) in the text. the case of PATTY would be “
We also highlight a more recent trend towards al- such as WordNet, FrameNet, VerbNet, PropBank, etc., ternative distant supervision methods based on em- – are often used for such purposes. These clustering beddings (e.g., [312,324,178]). Such approaches have techniques can then be used to extend the set of men- the benefit of not relying on NLP-based parsing tools, tions/patterns that map to a particular KB relation. but rather relying on distributional representations of An early approach applying such clustering was words, entities and/or relations in a fixed-dimensional Artequakt [3], which leverages WordNet knowledge vector space that, rather than producing a discrete – specifically synonyms and hypernyms – to detect parse-tree structure, provides a semantic representa- which pairs of relations can be considered equiva- tion of text in a (continuous) numeric space. Ap- lent or more specific than one another. A more re- proaches such as proposed by Lin et al. [178] go one cent version of such an approach is proposed by Ger- step further: rather than computing embeddings only ber et al. [114] in the context of their RdfLiveNews over the text, such approaches also compute embed- system, where they define a similarity measure be- dings for the structured KB, in particular, the KB en- tween relation patterns composed of a string similar- tities and their associated properties; these KB em- ity measure and a WordNet-based similarity measure, beddings can be combined with textual embeddings as well as the domain(s) and range(s) of the target to compute, for example, similarity between relation KB property associated with the pattern; thereafter, a mentions in the text and relations in the KB. graph-based clustering method is applied to group sim- We remark that tens of other DS-based approaches ilar patterns, where within each group, a similarity- have recently been published using Semantic Web KBs based voting mechanism is used to select a single pat- in the linguistic community, most often using Freebase tern deemed to represent that group. A similar ap- as a reference KB, taking an evaluation corpus from proach was employed by Liu et al. [181] for clus- the New York Times (originally compiled by Riedel et tering mentions, combining a string similarity mea- al. [257]). While strictly speaking such works would sure and a WordNet-based measure; however they note fall within the scope of this survey, upon inspection, that WordNet is not suitable for capturing similarity many do not provide any novel use of the KB itself, between terms with different grammatical roles (e.g., but rather propose refinements to the machine learn- “spouse”, “married”), where they propose to com- ing methods used. Hence we consider further discus- bine WordNet with a distributional-style analysis of sion of such approaches as veering away from the core Wikipedia to improve the similarity measure. Such a scope of this survey, particularly given their number. technique is also used by Dutta et al. [91] for cluster- Herein, rather than enumerating all works, we have in- ing relation mentions using a Jaccard-based similarity stead captured the seminal works and themes in the measure for keywords and a WordNet-based similar- area of distant supervision for REL; for further details ity measure for synonyms; these measures are used to on distant supervision for REL in a Semantic Web set- create a graph over which Markov clustering is run. ting, we can instead refer the interested reader to the An alternative clustering approach for generalized Ph.D. dissertation of Augenstein [9]. relation patterns is to instead consider the sets of en- tity pairs that each such pattern considers. Soderland 4.4. Relation Clustering and Mandhani [284] propose a clustering of patterns based on such an idea: if one pattern captures a (near) Relation mentions extracted from the text may re- subset of the entity pairs that another pattern captures, fer to the same KB relation using different terms, or they consider the former pattern to be subsumed by the may imply the existence of a KB relation through latter and consider the former pattern to infer relations hypernymy/sub-property relations. For example, men- pertaining to the latter. A similar approach is proposed tions of the form “X is married to Y”, “X is the by Nakashole et al. [222] in the context of their PATTY spouse of Y”, etc., can be considered as referring system, where subsumption of relation patterns is like- to the same KB property (e.g., dbo:spouse), while a wise computed based on the sets of entity pairs that mention of the form “X is the husband of Y” can they capture; to enable scalable computation, the au- likewise be considered as referring to that KB property, thors propose an implementation based on the MapRe- though in an implied form through hypernymy. Some duce framework. Another approach along these lines – REL approaches thus apply an analysis of such seman- proposed by Riedel et al. [258] – is to construct what tic relations – typically synonymy or hypernymy – to the authors call a universal schema, which involves cluster textual mentions, where external resources – creating a matrix that maps pairs of entities to KB re- J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 45 lations and relation patterns associated with them (be that they call Structured Discourse Graphs capturing it from training or test data); over this matrix, various the subject, predicate and object of the relation, as models are proposed to predict the probability that a well as (general) reification and temporal annotations; given relation holds between a pair of entities given LODifier [11] maps Boxer output to RDF by mapping the other KB relations and patterns the pair has been unary relations to rdf:type triples and binary rela- (probabilistically) assigned in the matrix. tions to triples with a custom predicate, using RDF reification to represent the disjunction, negation, etc., 4.5. RDF Representation present in the DRS output; FRED [111] applies an n-ary–relation-style representation of the DRS-based In order to populate Semantic Web KBs, it is neces- relations extracted by Boxer, likewise mapping unary sary for the REL process to represent output relations relations to rdf:type triples and binary relations to as RDF triples. In the case of those systems that pro- triples with a custom predicate (see Listing4); etc. duce binary relations, each such relation will typically Rather than creating a bespoke RDF representation, be represented as an RDF triple unless additional an- other systems rather try to map or project extracted re- notations about the relation – such as its provenance lations directly to the native identifier scheme and data – are also captured. In the case of systems that per- model of the reference KB. Likewise, those systems form EEL and a DS-style approach, it is furthermore that first create a bespoke RDF representation may ap- the case that new IRIs typically will not need to be ply an a posteriori mapping to the KB/ontology. Such minted since the EEL process provides subject/object methods for performing mappings are now discussed. IRIs while the DS labeling process provides the pred- icate IRI from the KB. This process has the benefit 4.6. Relation mapping of also directly producing RDF triples under the na- tive identifier scheme of the KB. However, for systems While in a distant supervision approach, the patterns that produce n-ary relations – e.g., according to frames, and features extracted from textual relation mentions DRT, etc. – in order to populate the KB, an RDF rep- are directly associated with a particular (typically bi- resentation must be defined. Some systems go further nary) KB property, REL systems based on other ex- and provide RDFS/OWL axioms that enrich the output traction methods – such as parsing according to legacy with well-defined semantics for the terms used [111]. OpenIE systems, or frames/DRS theory – are still left The first step towards generating an RDF represen- to align the extracted relations with a given KB. tation is to mint new IRIs for the entities and rela- A common approach – similar to distant supervision tions extracted. The BOA [115] framework proposes – is to map pairs of entities in the parsed relation men- tions to the KB to identify what known relations cor- to first apply Entity Linking using a DS-style approach 62 (where predicate IRIs are already provided), where for respond to a given relation pattern. This process is emerging entities not found in the KB, IRIs are minted more straightforward when the extracted relations are based on the mention text. The FRED system [111] already in a binary format, as produced, for example, likewise begins by minting IRIs to represent all of by OpenIE systems. Dutta et al. [91] apply such an ap- the elements, roles, etc., produced by the Boxer DRT- proach to map the relations extracted by OpenIE sys- based parser, thus skolemizing the events: grounding tems – namely the NELL and ReVerb tools – to DB- the existential variables used to denote such events pedia properties: the entities in triples extracted from with a constant (more specifically, an IRI). such OpenIE systems are mapped to DBpedia by an Next, an RDF representation must be applied to EEL process, where existing KB relations are fed into structure the relations into RDF graphs. In cases where an association-rule mining process to generate candi- date mappings for a given OpenIE predicate and pair of binary relations are not simply represented as triples, entity-types. These rules are then applied over clusters existing mechanisms for RDF reification – namely of OpenIE relations to generate fresh DBpedia triples. RDF n-ary relations, RDF reification, singleton prop- erties, named graphs, etc. (see [133,271] for exam- ples and more detailed discussion) – can, in theory, 62More specifically, we distinguish between distant supervision be adopted. In general, however, most systems define approaches that use KB entities and relations to extract relation men- tions (as discussed previously), and the approaches here, which ex- bespoke representations (most similar to RDF n-ary tract such mentions without reference to the KB and rather map to relations). Among these, Freitas et al. [107] propose the KB in a subsequent step, using matches between existing KB a bespoke RDF-based discourse representation format relations and parsed mentions to propose candidate KB properties. 46 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
In the case of systems that natively extract n-ary re- 4.7. System Summary and Comparison lations – e.g., those systems based on frames or DRS – the process of mapping such relations to a binary KB Based on the previously discussed techniques, an relation – sometimes known as projection of n-ary re- overview of the highlighted REL systems is provided lations [166] – is considerably more complex. Rather in Table6, with a column legend provided in the cap- than trying to project a binary relation from an n-ary tion. As before, we highlight approaches that are di- relation, some approaches thus rather focus on map- rectly related with the Semantic Web and that offer a ping elements of n-ary relations to classes in the KB. peer-reviewed publication with novel technical details Such an approach is adopted by Gerber et al. [114] regarding REL. With respect to the Entity Recogni- for mapping elements of binary relations extracted tion column, note that many approaches delegate this via learned patterns to DBpedia entities and classes. task to external tools and systems – such as DBpedia The FRED system [111] likewise provides mappings Spotlight [199], GATE [69], Stanford CoreNLP [186], of its DRS-based relations to various ontologies and TagMe [99], Wikifier [255], etc. – which are men- KBs, including WordNet and DOLCE ontologies (us- tioned in the respective column. ing WSD) and the DBpedia KB (using EEL). Choosing an RE strategy for a particular applica- On the other hand, other systems do propose tech- tion scenario can be complex given that every approach niques for projecting binary relations from n-ary re- has pros and cons regarding the application at hand. lations and linking them with KB properties; such a However, we can identify some key considerations that process must not only identify the pertinent KB prop- should be taken into account: erty (or properties), but also the subject and object – Binary vs. n-ary: Does the application require bi- entities for the given n-ary relation; furthermore, for nary relations or does it require n-ary relations? DRS-style relations, care must be taken since the state- Oftentimes the results of systems that produce ment may be negated or may be part of a disjunc- binary relations can be easier to integrate with tion. Along those lines, Exner and Nugues [95] ini- existing KBs already composed of such, where tially proposed to generate triples from DRS relations DS-based approaches, in particular, will produce by means of a combinatorial approach, filtering rela- triples using the identifier scheme of the KB it- tions expressed with negation. In follow-up work on self. On the other hand, n-ary relations may cap- the Refractive system, Exner and Nugues [96] later ture more nuances in the discourse implicit in the propose a method to map n-ary relations extracted us- text, for example, capturing semantic roles, nega- ing PropBank to DBpedia properties: existing rela- tion, disjunction, etc., in complex relations. tions in the KB are matched to extracted PropBank – Identifier creation: Does the application require roles such that more matches indicate a better property finding and identifying new instances in the text match; thereafter, the subject and object of the KB re- not present in the KB? Does it require finding and lation are generalized to their KB class (used to iden- identifying emerging relations? Most DS-based tify subject/object in extracted relations), and the rele- approaches do not consider minting new identi- vant KB property is proposed as a candidate for other fiers but rather focus on extracting new triples instances of the same role (without a KB relation) and within the KB’s universe (the sets of identifiers pairs of entities matching the given types. Legalo [251] it provides). However, there are some exceptions, proposes a method for mapping FRED results to bi- such as BOA [115]. On the other hand, most REL nary KB relations by concatenating the labels of nodes systems dealing with n-ary relations mint new on paths in the FRED output between elements iden- IRIs as part of their output representation. tified as (potential) subject/object pairs, where these – Language: Does the application require extrac- concatenated path labels are then mapped to binary tion for a language other than English? Though KB properties to project new RDF triples. Rouces et not discussed previously, we note that almost all al. [271], on the other hand, propose a rule-based ap- systems presented here are evaluated only for En- proach to project binary relations from FrameNet pat- glish corpora, the exceptions being BOA [115], terns, where dereification rules are constructed to map which is tested for both English and German text; suitable frames to binary triples by mapping frame el- and the work by Fossati et al. [104], which is ements to subject and object positions, creating a new tested for Italian text. Thus in scenarios involv- predicate from appropriate conjugate verbs, further fil- ing other languages, it is important to consider tering passive verb forms with no clear binary relation. to what extent an approach relies on a language- J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 47
Table 6 Overview of Relation Extraction and Linking systems Entity Recognition denotes the NER or EEL strategy used; Parsing denotes the method used to parse relation mentions (Cons.: Constituency Parsing, Dep.: Dependency Parsing, DRS: Discourse Representation Structures, Emb.: Embeddings); PS refers to the Property Selection method (PG: Property Generation, RM: Relation Mapping, DS: Distant Supervision); Rep. refers to the reification model used for representation (SR: Standard Reification, BR: Binary Relation); KB refers to the main knowledge-base used; ‘—’ denotes no information found, not used or not applicable System Year Entity Recognition Parsing PS Rep. KB Domain
Artequakt [3] 2003 GATE Patterns RM BR Artists ontology Artists
AugensteinMC [10] 2016 Stanford Features DS BR Freebase Open
BOA [115] 2012 DBpedia Spotlight Patterns, Features DS BR DBpedia News, Wikipedia
DeepDive [233] 2012 Stanford Dep., Features DS BR Freebase Open
DuttaMS [90,91] 2015 Keyword OpenIE DS BR DBpedia Open
ExnerN [95] 2012 Wikifier Frames DS BR DBpedia Wikipedia
Fact Extractor [104] 2017 Wiki Machine Frames DS n-ary DBpedia Football
FRED [111] 2016 Stanford, TagMe DRS PG / RM n-ary DBpedia / BabelNet Open
Graphia [107] 2012 DBpedia Spotlight Dep. PG SR DBpedia Wikipedia
Knowledge Vault [84] 2014 — Features DS BR Freebase Open
LinSLLS [178] 2016 Stanford Emb. DS BR Freebase News
LiuHLZLZ [181] 2013 Stanford Dep., Features DS BR YAGO News
LODifier [11] 2012 Wikifier DRS RM SR WordNet Open
MIML-RE [291] 2012 Stanford Dep., Features DS BR Freebase News, Wikipedia
MintzBSJ [212] 2009 Stanford Dep., Features DS BR Freebase Wikipedia
MultiR [140] 2011 Stanford Dep., Features DS BR Freebase Wikipedia
Nebhi [226] 2013 GATE Patterns, Dep. DS BR DBpedia News
NguyenM [232] 2011 — Dep., Cons. DS BR YAGO Wikipedia
PATTY [222] 2013 Stanford Dep., Patterns RM — YAGO Wikipedia
PIKES [61] 2016 DBpedia Spotlight SRL RM n-ary DBpedia Open
PROSPERA [221] 2011 Keyword Patterns RM BR YAGO Open
RdfLiveNews [114] 2013 DBpedia Spotlight Patterns PG / DS BR DBpedia News
Refractive [96] 2014 Stanford Frames DS SR — Wikipedia
RiedelYM [257] 2010 Stanford Dep., Features DS BR Freebase News
Sar-graphs [166] 2016 Dictionary Dep. DS — Freebase / BabelNet Open
TakamatsuSN [292] 2012 Hyperlinks Dep. DS BR Freebase Wikipedia
Wsabie [312] 2013 Stanford Dep., Features, Emb. DS BR Freebase News 48 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
specific technique, such as POS-tagging, depen- ticular theory by which such relations are extracted; dency parsing, etc. Unfortunately, given the com- likewise, in DS-related scenarios, the expert must label plexity of REL, most works are heavily reliant the data according to the available KB relations, which on such language-specific components. Possible may be a tedious task requiring in-depth knowledge of solutions include trying to replace the particu- the KB. Rather than creating a gold-standard dataset, lar component with its equivalent in another lan- another approach is to apply a posteriori assessment of guage (which has no guarantees to work as well the output by human judges, i.e., run the process over as those tested in evaluation), or, as proposed for unlabeled text, generate relations, and have the output the FRED [111] tool, use an existing API (e.g., validated by human judges; while this would appear Bing!, Google, etc.) to translate the text to the more reasonable for systems based on frames or DRS supported language (typically English), with the – where creating a gold-standard for such complex re- obvious caveat of the potential for translation er- lations would be arduous at best – there are still prob- rors (though such services are continuously im- lems in assessing, for example, recall.63 Rather than proving in parallel with, e.g., Deep Learning). relying on costly manual annotations, some systems – Scale & Efficiency: In applications dealing with rather propose automated methods of evaluation based large corpora, scalability and efficiency become on the KB where, for example, parts of the KB are crucial considerations. With some exceptions, withheld and then experiments are conducted to see if most of the approaches do not explicitly tackle the tool can reinstate the withheld facts or not; how- the question of scale and efficiency. On the other ever, such approaches offer rather approximate evalua- hand, REL should be highly parallelizable given tion since the KB is incomplete, the text may not even that processing of different sentences, paragraphs mention the withheld triples, and so forth. and/or documents can be performed indepen- In summary, then, approaches for evaluating REL dently assuming some globally-accessible knowl- are quite diverse and in many cases there are no stan- edge from the KB. Parallelization has been used, dard criteria for assessing the adequacy of a particular e.g., by Nakashole et al. [222], who cluster re- evaluation method. Here we discuss some of the main lational patterns using a distributed MapReduce themes for evaluation, broken down by datasets used, framework. Indeed, initiatives such as Knowledge how evaluators are employed to judge the output, how Vault – using standard DS-based REL techniques automated evaluation can be conducted, and what are to extract 1.6 billion triples from a large-scale the typical metrics considered. Web corpus – provide a practical demonstration Datasets: Most approaches consider REL applied that, with careful engineering and selection of to general-domain corpora, such as Wikipedia arti- techniques, REL can be applied to corpora at a cles, Newspaper articles, or even webpages. However, very large (potentially Web) scale. to simplify evaluation, many approaches may restrict – Various other considerations, such as availability REL to consider a domain-specific subset of such cor- or licensing of software, provision of an API, etc., pora, a fixed subset of KB properties or classes, and may also need to be taken into account. so forth. For example, Fossati et al. [104] focus their Of course, a key consideration when choosing an REL efforts on the Wikipedia articles about Italian soc- REL approach is the quality of output produced by that cer players using a selection of relevant frames; Au- approach, which can be assessed using the evaluation genstein et al. [10] apply evaluation for relations per- protocols discussed in the following section. taining to entities in seven Freebase classes for which relatively complete information is available, using the 4.8. Evaluation Google Search API to find relevant documents for each such entity; and so forth. REL is a challenging task, where evaluation is like- A number of standard evaluation datasets have, wise complicated by a number of fundamental factors. however, emerged, particularly for approaches based In general, human judgment is often required to assess the quality of the output of systems performing such 63 a task, but such assessments can often be subjective. Likewise we informally argue that a human judge presented with results of a system is more likely to confirm that output and Creating a gold-standard dataset can likewise be com- give it the benefit of subjectivity, especially when compared with the plicated, particularly for those systems producing n- creation of a gold standard dataset where there is more freedom in ary relations, requiring an expert informed on the par- choice of relations and more ample opportunity for subjectivity. J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 49 on distant supervision. A widely reused gold-standard by humans. Many papers assign experts to evaluate the dataset, for example, was that initially proposed by results, typically (we assume) authors of the papers, Riedel et al. [257] for evaluating their system, where though often little detail on the exact evaluation pro- they select Freebase relations pertaining to people, cess is given, aside from a rater agreement expressed businesses and locations (corresponding also to NER as Cohen’s or Fleiss’ κ-measure for a fixed number of types) and then link them with New York Times ar- evaluators (as discussed in Section 3.7). ticles, first using Stanford NER to find entities, then Aside from expert evaluation, some works lever- linking those entities to Freebase, and finally selecting age crowdsourcing platforms for labeling training the appropriate relation (if any) to label pairs of enti- and test datasets, where a broad range of users con- ties in the same sentence with; this dataset was later tribute judgments for a relatively low price. Amongst reused by a number of works [140,291,258]. Other such works, we can mention Mintz et al. [212] using such evaluation resources have since emerged. Google Amazon’s Mechanical Turk67 for evaluating relations, Research64 provides five REL corpora, with relation while Legalo [251] and Fossati et al. [104] use the mentions from Wikipedia linked with manual annota- Crowdflower68 platform (now called Figure Eight). tion to five Freebase properties indicating institutions, Automated evaluation: Some works have proposed date of birth, place of birth, place of death, and ed- methods for performing automated evaluation of REL ucation degree. Likewise, the Text Analysis Confer- processes, in particular for testing DS-based methods. ence often hosts a Knowledge Base Population (TAC– A common approach is to perform held-out experi- KBP) track, where evaluation resources relating to the ments, where KB relations are (typically randomly) REL task can be found65; such resources have been omitted from the training/DS phase and then metrics used and further enhanced, for example, by Surdeanu are defined to see how many KB relations are returned et al. [291] for their evaluation (whose dataset was in by the process, giving an indicator of recall; the in- turn used by other works, e.g., by Min et al. [210], tuition of such approaches is that REL is often used DeepDive [233], etc.). Another such initiative is hosted for completing an incomplete KB, and thus by hold- at the European Semantic Web Conference, where the ing back KB triples, one can test the process to see Open Knowledge Extraction challenge (ESWC–OKE) how many such triples the process can reinstate. Such has hosted materials relating to REL using RDFa an- 66 an approach avoids expensive manual labeling but is notations on webpages as labeled data. not very suitable for precision since the KB is incom- Note that all prior evaluation datasets relate to bi- plete, and likewise assumes that held-out KB relations subject predicate object nary relations of the form – – . are both correct and mentioned in the text. On the n Creating gold standard datasets for -ary relations is other hand, such experiments can help gain insights at complicated by the heterogeneity of representations larger scales for a more diverse range of properties, that can be employed in terms of frames, DRS or and can be used to assess a relative notion of preci- other theories used. To address this issue, Gangemi et sion (e.g., to tune parameters), and have thus been used al. [112] proposed the construction of RDF graphs by by Mintz et al. [212], Takamatsu et al. [292], Knowl- means of logical patterns known as motifs that are ex- edge Vault [84], Lin et al. [178], etc. On the other tracted by the FRED tool and thereafter manually cor- hand, as mentioned previously, some works – includ- rected and curated by evaluators to follow best Seman- ing Knowledge Vault [84] – adopt a partial Closed tic Web practices; the result is a corpus annotated by World Assumption as a heuristic to generate negative instances of such motifs that can be reused for evalua- examples taking into account the incompleteness of tion of REL tools producing similar such relations. the KB; more specifically, extracted triples of the form Evaluators: In scenarios for which a gold standard (s, p,o0) are labeled incorrect if (and only if) a triple dataset is not available – or not feasible to create – the (s, p,o) is present in the KB but (s, p,o0) is not. results of the REL process are often directly evaluated Metrics: Standard evaluation measures are typically applied, including precision, recall, F-measure, accu- 64https://code.google.com/archive/p/relation- racy, Area-Under-Curve (AUC–ROC), and so forth. extraction-corpus/downloads However, given that relations may be extracted for 65For example, see https://tac.nist.gov/2017/KBP/ ColdStart/index.html. 66For example, see Task 3: https://github.com/ 67https://www.mturk.com/mturk/welcome anuzzolese/oke-challenge-2016. 68https://www.figure-eight.com/ 50 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web multiple properties, sometimes macro-measures such to perform a held-out evaluation, comparing their ap- as Mean Average Precision (MAP) are applied to proach with that of Mintz et al. [212], Hoffmann et summarize precision across all such properties rather al. [140], and Surdeanu et al. [291], for which source than taking a micro precision measure [212,292,90]. code is available. These results show that fixing 50% Given the subjectivity inherent in evaluating REL, Fos- precision, Mintz et al. achieved 5% recall, Hoffmann sati et al. [104] use a strict and lenient version of et al. and Surdeanu et al. achieved 10% recall, while precision/recall/F-measure, where the former requires the best approach by Lin et al. achieved 33% recall. As the relation to be exact and complete, while the latter a general conclusion, these results suggest that there is also considers relations that are partially correct; re- still considerable room for improvement in the area of lating to the same theme, the Legalo system includes REL based on distant supervision. confidence as a measure indicating the level of agree- ment and trust in crowdsourced evaluators for a given 4.9. Summary experiment. Some systems produce confidence or sup- ports for relations, where P@k measures are some- This section presented the task of Relation Ex- times used to measure the precision for the top-k re- traction and Linking in the context of the Semantic sults [212,257,181,178]. Finally, given that REL is Web. The applications for such a task include KB inherently composed of several phases, some works Population, Structured Discourse Representation, Ma- present metrics for various parts of the task; as an chine Reading, Question Answering, Fact Verification, example, for extracted triples, Dutta [91] considers a amongst a variety of others. We discussed relevant pa- property precision (is the mapped property correct?), pers following a high-level process consisting of: en- instance precision (are the mapped subjects/objects tity extraction (and coreference resolution), relation correct?), triple precision (is the extracted triple cor- parsing, distant supervision, relation clustering, RDF rect?), amongst other measures to, for example, indi- representation, relation mapping, and evaluation. It is cate the ratio of extracted triples new to the KB. worth noting, however, that not all systems follow these steps in the presented order and not all systems Third-party comparisons: While some REL papers apply (or even require) all such steps. For example, en- include prior state-of-the-art approaches in their evalu- tity extraction may be conducted during relation pars- ations for comparison purposes, we are not aware of a ing (where particular arguments can be considered as third-party study providing evaluation results of REL extracted entities), distant supervision does not require systems. Although Gangemi [110] provides a compar- a formal representation nor relation-mapping phase, ative evaluation of Alchemy, CiceroLite, FRED and and so forth. Hence the presented flow of techniques ReVerb – all with public APIs available – for extract- should be considered illustrative, not prescriptive. ing relations from a paragraph of text on the Syrian In general, we can distinguish two types of REL sys- war, he does not publish results for a linking phase; tems: those that produce binary relations, and those FRED is the only REL tool tested that outputs RDF. that produce n-ary relations (although binary rela- Despite a lack of third-party evaluation results, some tions can subsequently be projected from the latter comparative metrics can be gleaned from the use of tools [271,251]). With respect to binary relations, dis- standard datasets over several papers relating to dis- tant supervision has become a dominating theme in tant supervision; we stress, however, that these are of- recent approaches, where KB relations are used, in ten published in the context of evaluating a particular combination with EEL and often CR, to find exam- system (and hence are not strictly third-party compar- ple mentions of binary KB relations, generalizing pat- 69 isons ). With respect to DS-based approaches, and as terns and features that can be used to extract further previously mentioned, a prominent dataset used is the mentions and, ultimately, novel KB triples; such ap- one proposed by Riedel et al. [257], with articles from proaches are enabled by the existence of modern Se- the New York Times corpus annotated with 53 differ- mantic Web KBs with rich factual information about ent types of relations from Freebase; the training set a broad range of entities of general interest. Other contains 18,252 relations, while the test set contains approaches for extracting binary relations rather rely 1,950 relations. Lin et al. [178] then used this dataset on mapping the results of existing OpenIE systems to KBs/ontologies. With respect to extracting n-ary rela- 69We remark that the results of Gangemi [110] are strictly not tions, such approaches rely on more traditional linguis- third-party either due to the inclusion of results from FRED [111]. tic techniques and resources to extract structures ac- J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 51 cording to frame semantics or Discourse Representa- uation of approaches under more diverse condi- tion Theory; the challenge thereafter is to represent the tions. The problem of creating a reference gold results as RDF and, in particular, to map the results to standard would then depend on the first point, re- an existing KB, ontology, or collection thereof. lating to what types of relations should be tar- geted for extraction from text in domain-specific 4.10. Open Questions and/or open-domain settings, and how the output should be represented to allow comparison with REL is a very important task for populating the Se- the labeled relations for the dataset. mantic Web. Several techniques have been proposed – Evaluation: Existing REL approaches extract dif- for this task in order to cover the extraction of binary ferent outputs relating to particular entity types, and n-ary relations from text. However, some aspects domains, structures, and so on. Thus, evaluating/- could still be improved or developed further: comparing different approaches is not a straight- forward task. Another challenge is to allow for a – Relation types: Unlike EEL where particular more fine-grained evaluation of REL approaches, types of entities are commonly extracted, in REL which are typically complex pipelines involving it is not easy to define the types of relations to various algorithms, resources, and often external be extracted and linked to the Semantic Web. tools, where noisy elements extracted in some Previous studies, such as the one presented by early stage of the process can have a major nega- Storey [289], provide an organization of relations tive effect on the final output, making it difficult – identified from disciplines such as linguistics, to interpret the cause of poor evaluation results or logic, and cognitive psychology – that can be in- the key points that should be improved. corporated into traditional database management systems to capture the semantics of real world information. However, to the best of our knowl- 5. Semi-structured Information Extraction edge, a thorough categorization of semantic rela- tionships on the Semantic Web has not been pre- The primary focus of the survey – and the sections sented, which in turn, could be useful for defin- thus far – is on Information Extraction over unstruc- ing requirements of information representation, tured text. However, the Web is full of semi-structured standards, rules, etc., and their representation in content, where HTML, in particular, allows for demar- existing standards (RDF, RDFS, OWL). cating titles, links, lists, tables, etc., imposing a limited – Specialized Settings/Multilingual REL: In brief, structure on documents. While it is possible to simply we can again raise the open question of adapt- extract the text from such sources and apply previous ing REL to settings with noisy text (such as Twit- methods, the structure available in the source, though ter) and generalizing REL approaches to work limited, can offer useful hints for the IE process. with multiple languages. In this context, DS ap- Hence a number of works have emerged propos- proaches may prove to be more successful given ing Information Extraction methods using Semantic that they rely more on statistical/learning frame- Web languages/resources targeted at semi-structured works (i.e., they do not require curated databases sources. Some works are aimed at building or other- of relations, roles, etc., which are typically spe- wise enhancing Semantic Web KBs (where, in fact, cific to a language), and given that KBs such as many of the KBs discussed originated from such a pro- Wikidata, DBpedia and Babelnet can provide ex- cess [170,138]). Other works rather focus on enhanc- amples of relations in multiple languages. ing or annotating the structure of the input corpus us- – Datasets: The preferred evaluation method for ing a Semantic Web KB as reference. Some works the analyzed approaches is through an a posteri- make significant reuse of previously discussed tech- ori manual assessment of represented data. How- niques for plain text – particularly Entity Linking and ever, this is an expensive task that requires human sometimes Relation Extraction – adapted for a par- judges with adequate knowledge of the domain, ticular type of input document structure. Other works language, and representation structures. Although rather focus on custom techniques for extracting infor- there are a couple of labeled datasets already pub- mation from the structure of a particular data source. lished (particularly for DS approaches), the defi- Our goal in this section is thus to provide an nition of further datasets would benefit the eval- overview of some of the most popular techniques and 52 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web tools that have emerged in recent years for Information user is browsing. A use-case is discussed for Extraction over semi-structured sources of data us- such annotation/recommendation in the biomedi- ing Semantic Web languages/resources. Given that the cal domain, where a SKOS taxonomy can be used techniques vary widely in terms of the type of struc- to recommend links to further material on more/- ture considered, we organize this section differently less specific concepts appearing in the text, with from those that came before. In particular, we proceed different types of users (e.g., doctors, the public) by discussing two prominent types of semi-structured receiving different forms of recommended links. sources – markup documents and tables – and discuss DBpedia (2007) [170] is a prominent initiative to ex- works that have been proposed for extracting informa- tract a rich RDF KB from Wikipedia. The main tion from such sources using Semantic Web KBs. source of extracted information comes from the We do not include languages or approaches for semi-structured info-boxes embedded in the top mapping from one explicit structure to another (e.g., right of Wikipedia articles; however, further in- R2RML [73]), nor that rely on manual scraping (e.g., formation is also extracted from abstracts, hyper- Piggy Bank [146]), nor tools that simply apply exist- links, categories, and so forth. While much of ing IE frameworks (e.g., Magpie [92], RDFaCE [158], the extracted information is based on manually- SCMS [229]). Rather we focus on systems that ex- specified mappings for common attributes, com- tract and/or disambiguate entities, concepts, and/or re- ponents are provided for higher-recall but lower- lations from the input sources and that have methods precision automatic extraction of info-box infor- adapted to exploit the partial structure of those sources mation, including recognition of datatypes, etc. (i.e., they do not simply extract and apply IE processes DeVirgilio (2011) [308] uses Keyphrase Extraction to over plain text). Again, we only include proposals that semantically annotate webpages, linking key- in some way directly involve a Semantic Web stan- words to DBpedia. The approach breaks web- dard (RDF(S)/OWL/SPARQL, etc.), or a resource de- pages down into “semantic blocks” describing scribed in those standards, be it to populate a Semantic specific elements based on HTML elements; Web KB, or to link results with such a KB. Keyphrase Extraction is the applied over indi- vidual blocks. Evaluation is conducted in the E- 5.1. Markup Documents Commerce domain, adding RDFa annotations us- ing the Goodrelations vocabulary [131]. The content of the Web has traditionally been struc- Epiphany (2011) [2] aims to semantically annotate tured according to the HyperText Markup Language webpages with RDFa, incorporating embedded (HTML), which lays out a document structure for web- links to existing Linked Data KBs. The process is pages to follow. While this structure is primarily per- based on an input KB, where labels of instances, ceived as a way to format, display and offer naviga- classes and properties are extracted. A custom IE tional links between webpages, it can also be – and has pipeline is then defined to chunk text and match been – leveraged in the context of Information Extrac- it with the reference labels, with disambiguation tion. Such structure includes, for example, the pres- performed based on existing relations in the KB ence of hyperlinks, title tags, paths in the HTML parse for resolved entities. Facts from the KB are then tree, etc. Other Web content – such as Wikis – may be matched to the resolved instances and used to em- formatted in markup other than HTML, where we in- clude frameworks for such formats here. We provide bed RDFa annotations in the webpage. an overview of these works in Table7. Given that all Knowledge Vault (2014) [84] was discussed before such approaches implement diverse methods that de- in the context of Relation Extraction & Linking pend on the markup structure leveraged, we will not over text. However, the system also includes a discuss techniques in detail. However, we will provide component for extracting features from the struc- more detailed discussion for IE techniques that have ture of HTML pages. More specifically, the sys- been proposed for HTML tables in a following section. tem extracts the Document Object Model (DOM) from a webpage, which is essentially a hierarchi- COHSE (2008) [18] (Conceptual Open Hypermedia cal tree of HTML tags. For relations identified on Service) uses a reference taxonomy to provide the webpage using a DS approach, the path in the personalized semantic annotation and hyperlink DOM tree between both entities (for which an ex- recommendation for the current webpage that a isting KB relation exists) is extracted as a feature. J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web 53
Table 7 Overview of Information Extraction systems for Markup Documents Task denotes the IE task(s) considered (EEL: Entity Extraction & Linking, CEL: Concept Extraction & Linking, REL: Relation Extraction & Linking); Structure denotes the type of document structure leveraged for the IE task; ‘—’ denotes no information found, not used or not applicable System Year Task Source Domain Structure KB
COHSE [18] 2008 CEL Webpages Medical Hyperlinks Any (SKOS)
DBpedia [170] 2007 EEL/CEL/REL Wikipedia Open Wiki —
DeVirgilio [308] 2011 EEL/CEL Webpages Commerce HTML (DOM) DBpedia
Epiphany [2] 2011 EEL/CEL/REL Webpages Open HTML (DOM) Any (SPARQL)
Knowledge Vault [84] 2014 EEL/CEL/REL Webpages Open HTML (DOM) Freebase
Legalo [251] 2014 REL Webpages Open Hyperlinks —
LIEGE [276] 2012 EEL Webpages Open Lists YAGO
LODIE [113] 2014 EEL/REL Webpages Open HTML (DOM) Any (SPARQL)
RathoreR [281] 2014 CEL Wikipedia Physics Titles Custom ontology
YAGO (2007) [138] 2007 EEL/CEL/REL Wikipedia Open Wiki —
Legalo (2014) [251] applies Relation Extraction based duction: leveraging the often regular structure of on the hyperlinks of webpages that describe en- webpages on the same website to extract a map- tities, with the intuition that the anchor text (or ping that serves to extract information in bulk more generalized context) of the hyperlink will from all its pages. LODIE then proposes to map contain textual hints about the relation between webpages to an existing KB to identify the paths both entities. More specifically, a frame-based in the HTML parse tree that lead to known entities representation of the textual context of the hyper- for concepts (e.g., movies), their attributes/rela- link is extracted and linked with a KB; next, to tions (e.g., runtime, director), and associated val- create a label for a direct binary relation (an RDF ues. These learned paths can then be applied to triple), rules are applied on the frame-based rep- unannotated webpages on the site to extract fur- resentation to concatenate labels on the shortest ther (analogous) information. path, adding event and role tags. The label is then RathoreR (2014) [281] focus on Topic Modeling for linked to properties in existing vocabularies. webpages guided by a reference ontology. The LIEGE (2012) [276] (Link the entIties in wEb lists overall process involves applying Keyphrase Ex- with the knowledGe basE) performs EEL with re- traction over the textual content of the webpage, spect to YAGO and Wikipedia over the text ele- mapping the keywords to an ontology, and then ments of HTML lists embedded in webpages. The using the ontology to decide the topic. However, authors propose specific features for disambigua- the authors propose to leverage the structure of tion in the context of such lists where, in partic- HTML, where keywords extracted from the title, ular, the main assumption is that the entities ap- the meta-tags or the section-headers are analyzed pearing in an HTML list will often correspond to first; if no topic is found, the process resorts to the same concept; this intuition is captured with using keywords from the body of the document. a similarity-based measure that, for a given list, YAGO (2007) [138] is another major initiative for ex- computes the distance of the types of candidate tracting information from Wikipedia in order to entities in the class hierarchy of YAGO. Other create a Semantic Web KB. Most information is typical disambiguation features for EEL, such as extracted from info-boxes, but also from cate- prior probability, keyword-based similarities be- gories, titles, etc. The system also combines in- tween entities, etc., are also applied. formation from GeoNames, which provides ge- LODIE (2014) [113] propose a method for using ographic context; and WordNet, which allows Linked Data to perform enhanced wrapper in- for extracting cleaner taxonomies from Wikipedia 54 J.L. Martinez-Rodriguez et al. / Information Extraction meets the Semantic Web
categories. A distinguishing aspect of YAGO2 is – Although column headers can be identified as the ability to capture temporal information as a such using (