KBPearl: A Knowledge Base Population System Supported by Joint Entity and Relation Linking

Xueling Lin, Haoyang Li, Hao Xin, Zijian Li, Lei Chen Department of Computer Science and Engineering The Hong Kong University of Science and Technology, Hong Kong, China {xlinai, hlicg, hxinaa, zlicb, leichen}@cse.ust.hk

ABSTRACT entities, relations, specified types, and categories. These Nowadays, most openly available knowledge bases (KBs) are KBs provide back-end support for various knowledge-centric incomplete, since they are not synchronized with the emerg- services of real-world applications, such as question answer- ing facts happening in the real world. Therefore, knowledge ing [68] and online recommendations [7]. base population (KBP) from external data sources, which However, most KBs are not complete [44, 46], since they extracts knowledge from unstructured text to populate KBs, have limited coverage of what happens in the real world and becomes a vital task. Recent research proposes two types of are not synchronized with the emerging entities and their solutions that partially address this problem, but the per- relations. Rerunning the KB construction process to keep formance of these solutions is limited. The first solution, these KBs up-to-date with real-time knowledge is too expen- dynamic KB construction from unstructured text, requires sive, since most KBs are built in a batch-oriented manner. specifications of which predicates are of interest to the KB, Therefore, knowledge base population (KBP) from external which needs preliminary setups and is not suitable for an data sources has been proposed [25, 29]. Specifically, this is in-time population scenario. The second solution, Open In- the task that takes an incomplete KB and a large corpus of formation Extraction (Open IE) from unstructured text, has text as input, then extracts knowledge from the text corpus limitations in producing facts that can be directly linked to to complete the incomplete elements in the KB. the target KB without redundancy and ambiguity. In this Currently, there are two major types of research works paper, we present an end-to-end system, KBPearl, for KBP, that address the KBP issue. The first type of works, such which takes an incomplete KB and a large corpus of text as DeepDive [12, 45, 57] and SystemT [10], aim for a dy- as input, to (1) organize the noisy extraction from Open namic and broader construction of KBs. However, these IE into canonicalized facts; and (2) populate the KB by prior works require a specification that which predicates are joint entity and relation linking, utilizing the context knowl- of interest to the KBP task. In other words, unless predi- edge of the facts and the side information inferred from the cates such as birthplace have been specified preliminarily by source text. We demonstrate the effectiveness and efficiency the systems, the knowledge regarding these predicates will of KBPearl against the state-of-the-art techniques, through never be discovered automatically. This limits the ability of extensive experiments on real-world datasets. these systems to extract knowledge from unstructured text for KBP without any preliminary setups. PVLDB Reference Format: The second type of works, Open Information Extraction Xueling Lin, Haoyang Li, Hao Xin, Zijian Li and Lei Chen. KB- (Open IE) [13, 47, 53], partially addresses the task of KBP. Pearl: A Knowledge Base Population System Supported by Joint The approaches based on Open IE allow the subjects and Entity and Relation Linking. PVLDB, 13(7): 1035-1049, 2020. DOI: https://doi.org/10.14778/3384345.3384352 objects in the triples to be arbitrary noun phrases extracted from external sources, while predicates will be arbitrary verb phrases. Such Open IE approaches not only improve the 1. INTRODUCTION efficiency of KBP in an unsupervised manner, but also rec- With the recent development of information extraction ognize entities and predicates that are not recorded in the techniques, numerous large-scale knowledge bases (KBs), current KBs. Nevertheless, there are two major problems such as [63] and DBpedia [3], have been con- that these Open IE approaches suffer from: structed. In these KBs, ontological knowledge is stored in the form of (subject, predicate, object) triples, repre- • The canonicalization problem: the triples extracted by senting millions of real-world semantic concepts, including Open IE tools (referred to as Open triples) that repre- sent the same semantic meaning are not grouped. For instance, the Open triples (Michael Jordan, was born This work is licensed under the Creative Commons Attribution- in, California), (Professor Jordan, has birthplace, NonCommercial-NoDerivatives 4.0 International License. To view a copy CA), and (M. Jordan, was born in, CA) will all be pre- of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For sented even if they represent an equivalent statement, any use beyond those covered by this license, obtain permission by emailing which leads to redundancy and ambiguity. Recently, some [email protected]. Copyright is held by the owner/author(s). Publication rights techniques [21,32,61,64] have been proposed to canonical- licensed to the VLDB Endowment. ize Open triples, by clustering those with the same seman- Proceedings of the VLDB Endowment, Vol. 13, No. 7 ISSN 2150-8097. tic meaning in one group. However, such approaches have DOI: https://doi.org/10.14778/3384345.3384352 high overheads and are not geared up for in-time KBP.

1035 • The linking problem: the knowledge in the Open triples In this way, we capture the concepts mentioned in the is not mapped to the ontological structure of the current documents with global coherence. Specifically, we propose KBs. In other words, Open IE techniques only extract a novel similarity design schema to formulate the similar- (Michael Jordan, was born in, California), without ity between entity (resp., predicate) pairs based on their indicating which entity the noun phrase Michael Jordan shared keyphrases stored in the target KB. refers to. Recent techniques [16, 52] have been proposed • We perform a joint entity and relation linking task on the to address both entity linking and relation linking tasks semantic graph, aiming to disambiguate the entities and in short text, by linking the noun phrases (resp., relation predicates jointly for KBP. Specifically, we formulate this phrases) detected in the text to the entities (resp., pred- task as a dense subgraph detection problem. We prove icates) in the KB. However, there are still three draw- the NP-hardness of finding such a dense subgraph and backs. First, if a noun phrase cannot be linked to any propose a greedy graph densification algorithm as solu- existing entity in the KB, then the knowledge regarding tion. We also present two variants to employ the graph this noun phrase cannot be populated to the KB. Sec- densification algorithm in the KBPearl system in an effec- ond, most of the existing works disambiguate each phrase tive and efficient manner. Moreover, we locate the noun in a document separately. Nevertheless, it is beneficial to phrases representing new semantic concepts which do not consider entity and predicate candidates for the input doc- exist in the target KB, and create new entities for them. ument in combination (which we call “global coherence”), • We conduct extensive experiments to demonstrate the vi- to maximize the usable evidence for the candidate selec- ability of our system on several real-world datasets. Ex- tion process. For example, if a noun phrase artificial perimental results prove the effectiveness and efficiency of intelligence refers to AI (branch of computer science) KBPearl, compared with the state-of-the-art techniques. rather than A.I. (movie), then we would expect the noun phrase Michael Jordan in the same document to be M. The rest of the paper is organized as follows. We intro- Jordan (computer scientist and professor) rather than M. duce the important definitions in Section 2, and present the Jordan (basketball player). Third, none of the existing system overview in Section 3. We illustrate the construc- works utilizes such global coherence for joint entity and tion of the semantic graph in Section 4. We formulate the relation linking in long-text documents. dense subgraph problem, and present the graph densifica- tion algorithm to perform joint entity and relation linking In this paper, we propose KBPearl (KB Population sup- in Section 5. Section 6 presents our experimental results. ported by joint entity and relation linking), a novel end-to- We discuss the related works in Section 7 and conclude our end system which takes an incomplete KB and a set of doc- paper in Section 8. uments as input, to populate the KB by utilizing the knowl- edge extracted and inferred from the source documents. Ini- 2. PROBLEM DEFINITION tially, for each document, KBPearl conducts knowledge ex- We first introduce the important definitions. We then traction on the document to retrieve canonicalized facts and define the problem that our system addresses in Problem 1. side information. Specifically, the side information includes (1) all the named entity mentions in the document; (2) the Definition 1. Knowledge Base (KB). A KB Ψ can be types of these named entity mentions; and (3) the time- considered as a set of facts, where each fact is stored in the stamps in the document. After that, to capture the global form of a triple (s, p, o), representing the subject, predicate, coherence of concepts in the documents, KBPearl conducts and object of a fact. In the KB Ψ, we denote the set of KBP in the following steps. First, KBPearl constructs a entities as EΨ, the set of predicates as PΨ, and the collection semantic graph based on the knowledge extracted from the of literals as LΨ. For a triple (s, p, o), we have s ∈ EΨ, canonicalized facts and the side information. Second, KB- p ∈ PΨ and o ∈ EΨ ∪ LΨ. Pearl disambiguates the noun phrases and relation phrases in the facts, by determining a dense subgraph from the se- Definition 2. Open Triples. The Open triples are ex- tracted by the Open IE techniques from a given document. mantic graph in an efficient and effective manner. Finally, sub obj KBPearl links the canonicalized facts to the KB and creates An Open triple is denoted as t = (n , r, n ), where the noun phrase nsub is the subject, r stands for the relation, new entities for non-linkable noun phrases. obj Specifically, compared to the mainstream dynamic KB and noun phrase n represents the object. construction, we acquire facts for non-predefined predicates. Definition 3. Side Information. The side informa- Compared to the Open IE methods, we canonicalize the ar- tion of a given document d, denoted by S(d), includes (1) guments of facts and link them to the target KB. Compared all the named entity mentions detected in document d; (2) to the current techniques aiming for joint entity and rela- the types of these named entity mentions; and (3) the time- tion linking, we conduct the joint linking task on long-text stamp information in the document d. documents efficiently. To the best of our knowledge, this is the first work that conducts joint entity and relation linking Problem 1. KB Population From External Unstruc- and reports new entities for KBP leveraging the knowledge tured Text. Given an incomplete KB Ψ and a document d, inferred from long-text documents. We summarize the novel our task is to (1) extract the side information S(d) and a contributions of this paper as follows. set of canonicalized Open triples T = {t1, t2, ...} from the sub obj document d; (2) link each triple ti = (n , r, n ) in T to • We present an end-to-end system KBPearl to populate the KB Ψ in order to populate the knowledge in Ψ based on KB based on natural-language source documents with the the side information S(d). Specifically, in each ti ∈ T , we sub support of Open IE techniques. decide whether n is linked to an entity e ∈ EΨ, r is linked obj 0 • We employ a semantic graph-based approach to represent to a predicate p ∈ PΨ, and n is linked to an entity e ∈ EΨ the facts and side information in the source documents. or referred to as a literal l ∈ LΨ.

1036 Stage 1: Stage 2: Knowledge Linking Step 1.1: Side Information Extraction Step 2.1: Semantic Graph Construction Sentence Tokenizer POS Tagger Target KB � Candidate Entity and Predicate Generation Named Entity Time & Source Recognition Constraint Semantic Graph Step 1.2: Triple Extraction Step 1.3: Canonicalization Step 2.2: Graph Densification and Linking side information �(�) document � Open IE Tool Open triples �∗(�)

Figure 1: Framework overview.

sub obj Note that for a triple ti = (n , r, n ), there are two we have two assumptions: (1) One particular surname or possible situations: first name in one article refers to the same PERSON-type en- tity. For example, Jordan in an article refers to Michael sub • The first situation is that n is linked to an entity e ∈ Jordan, which appears in the former part of the same ar- obj EΨ, r is linked to a predicate p ∈ PΨ, and n is linked ticle; (2) The pronouns in one sentence refer to the first 0 to an entity e ∈ EΨ or referred to as a literal l ∈ LΨ. If subject of the triple extracted from the former sentences for 0 (e, p, e ) or (e, p, l) does not exist in the KB Ψ, we add it co-reference [5]. Third, following QKBfly [44], we employ a as a new fact to KB Ψ; pattern dictionary [43] to conduct basic canonicalization on sub obj • The second situation is that at least one of n and n the triples based on their relation phrases. The output of sub is determined to be non-linkable to the KB Ψ. If n this step is a set of canonicalized Open triples T ?(d). is non-linkable, we report a new entity for it. If nobj is non-linkable and it is detected as a named entity mention 3.2 Stage 2: Knowledge Linking in S(d), we also report a new entity for it. Finally, we add In this stage, we conduct knowledge linking from the out- the new fact to the KB Ψ as well. put of Stage 1 to the target KB. Step 2.1: Semantic Graph Construction. For each 3. SYSTEM FRAMEWORK OVERVIEW document d, we build a semantic graph based on the side Figure 1 depicts the framework overview of our system. information S(d) (produced in Step 1.1) and the extracted Given a KB Ψ and an input document d, our framework canonicalized Open triples T ?(d) (produced in Step 1.3). works in two major stages. The semantic graph captures all the noun phrases (resp., relation phrases) mentioned in document d, as well as all 3.1 Stage 1: Knowledge Extraction the candidate entities (resp., predicates) to each noun phrase In this stage, we conduct pre-processing and knowledge (resp., relation phrase) (see Section 4). extraction from the input document d. Step 2.2: Graph Densification and Linking. In this Step 1.1: Side Information Extraction. The docu- step, given the graph constructed in Step 2.1, our task is to ment d is pre-processed by a pipeline of linguistic tools, in- wisely select the optimal entity for each noun phrase, and cluding (1) sentence tokenization, (2) part-of-speech (POS) the optimal predicate for each relation phrase. The opti- tagging, (3) noun-phrase chunking and named entity recog- mal linking result is expected to represent global coherence nition (NER) with entity typing, and (4) Time Tagger tools among these entities and predicates. We formulate the task which extract time information from d. The output of this of joint entity and relation linking as an efficient graph den- step is a set of side information S(d) of document d (defined sification task (See Section 5). in Definition 3). Step 1.2: Triple Extraction Supported by Open IE. 4. SEMANTIC GRAPH CONSTRUCTION In this step, we employ an existing Open IE tool to extract ? sub obj Given a set of canonicalized Open triples T (d) extracted the triples in the form of t = (n , r, n ). Specifically, from natural-language document d (produced in Step 1.3), we choose several popular Open IE tools in the experiments and a set of side information S(d) extracted from d (pro- to evaluate the performance of our system (the details are duced in Step 1.1), the KBPearl system builds a weighted shown in Section 6). The output of this step is a set of Open undirected semantic graph to represent all these knowledge triples T (d) = {t1, t2, ...}. in an effective manner. An example of a semantic graph is Step 1.3: Triple Canonicalization and Noise Reduc- demonstrated in Figure 2. tion. Step 1.2 produces a large set of candidate Open We refer to the semantic graph built in this section as triples, which may still contain noise and redundancy. In G = (V,E). Specifically, V denotes the set of vertices (i.e., this step, to reconcile the semantic redundancy and group nodes), as introduced in Section 4.1, while E denotes the set the triples with the same semantic meaning, we perform of edges among the nodes in G, as presented in Section 4.2. triple canonicalization in three major tasks. First, we pre- process the triples, by filtering those that only contain pro- 4.1 Graph Nodes nouns, stopwords, and interrogative words as the subjects A node in the semantic graph is a container for the fol- and objects. Second, we conduct basic canonicalization on lowing semantic knowledge extracted from the document. the triples based on their noun phrases, by utilizing the In the semantic graph G = (V,E), there are four different side information S(d) obtained in Step 1.1. Specifically, types of nodes in V , i.e., V = N ∪ E ∪ R ∪ P.

1037 input document � = “Michael Jordan and Kurt Miller studied . Jordan supervised Miller in 2010.”

Michael Jordan: PERSON canonicalized Open (Michael Jordan, study, artificial intelligence) side information �(�) = Kurt Miller: PERSON triples �∗(�) = (Kurt Miller, study, artificial intelligence) artificial intelligence: NULL (Michael Jordan, supervise, Kurt Miller) 2010: TIME-STAMP

Noun Phrase Nodes (�) Michael Jordan: PERSON Kurt Miller: PERSON artificial intelligence: NULL 2010: TIME-STAMP

Entity Nodes (�)

M. Jordan M. Jordan Kurt Miller Kurt Miller A.I. AI (basketball player) (professor) (baseball player) (new entity) (movie) (CS branch)

Predicate Nodes (�)

supervise be educated at field of study (doctoral student(s) of a professor) (educational institution attended by subject) (specialization of a person)

Relation Phrase Nodes (�) supervise study

Figure 2: Example of a semantic graph. The solid edges are the potential edges that comprise the dense subgraph to capture the best coherence among the noun phrase nodes, entity nodes, relation phrase nodes, and predicate nodes. “Kurt Miller” is recognized as a new entity.

Noun Phrase Nodes (N ). A noun phrase node n ∈ N semantic graph G = (V,E), there are five different types of represents a subject or object noun phrase in an Open triple edges in E, where the details are listed as follows. t ∈ T ?(d), or a noun phrase in the side information S(d). Noun Phrase-Entity Edges. Let φe(ni, ej ) denote the Specifically, each noun phrase is assigned with a candidate weight of the noun phrase-entity edge between a noun phrase type from S(d). For example, in Figure 2, the noun phrase node ni ∈ N and an entity node ej ∈ E. Specifically, node Michael Jordan is typed as PERSON. Moreover, a noun φe(ni, ej ) reflects the likelihood that entity ej is the correct phrase without type information in S(d) is assigned with the linking for the noun phrase ni. type NULL. Following the recent entity disambiguation approaches [28, Entity Nodes (E). An entity node e ∈ E denotes an exist- 41, 44], we compute φe(ni, ej ) = pop(ni, ej ), which is the ing entity in the target KB Ψ. Note that E ⊂ EΨ. Specifi- relative frequency under which a query of ni points to ej in cally, for each noun phrase node n, we generate a set of can- the target KB Ψ. For example, Michael Jordan refers to M. didate entity nodes that (1) have n as one of their aliases, Jordan (basketball player) in 75% of all its occurrences, while and (2) match the types of noun phrases provided by the only 15% to M. Jordan (professor). Specifically, pop(ni, ej ) side information. For example, for the noun phrase node can be easily obtained as the number of sitelinks of ej re- Michael Jordan typed as PERSON in Figure 2, we involve all garding ni in KB Ψ as the popularity of ej . Note that the the entities that have Michael Jordan as one of their aliases textual similarity between ni and ej is not considered, since and typed as PERSON in Ψ as the candidate entity nodes, in- we only choose the entities that share the same alias name cluding M. Jordan (basketball player) and M. Jordan (pro- with ni as its candidate entities (see Section 4.1). fessor). For a noun phrase node typed as NULL, we involve Relation Phrase-Predicate Edges. Let φp(ri, pj ) denote all the entities with this noun phrase as their alias name as the weight of the relation phrase-predicate edge between a the candidate entity nodes, without any type constraint. relation phrase node ri ∈ R and a predicate node pj ∈ P. Relation Phrase Nodes (R). A relation phrase node r ∈ Specifically, φp(ri, pj ) reflects the likelihood that predicate ? R represents a relation phrase in an Open triple t ∈ T (d). pj is the true linking for the relation phrase ri. An example is the relation phrase node study in Figure 2. We collect the synonymous phrases of the relation phrase Predicate Nodes (P). A predicate node p ∈ P denotes an ri from pattern repositories such as PATTY [43], denoted as (ri). Moreover, we harness the target KB Ψ for labels and existing predicate in the target KB Ψ. Note that P ⊂ PΨ. P Specifically, for each relation phrase node p, we generate a aliases of the predicate pj , denoted as PΨ(pj ). We formulate set of predicate nodes that have p as one of their aliases. For φp(ri, pj ) as the overlap coefficient [62] between P(ri) and (p ), as shown in Equation 1. example, for the relation phrase node study in Figure 2, we PΨ j involve all the predicates that have study as the alias names |P(ri) ∩ PΨ(pj )| in Ψ, including field of study and be educated at. φp(ri, pj ) = (1) min(|P(ri)|, |PΨ(pj )|)

4.2 Graph Edges Entity-Entity Edges. Let ρ(ei, ej ) denote the weight of An edge between two nodes in the semantic graph repre- the entity-entity edge between a pair of entity nodes ei, ej ∈ sents the similarity measure between the two nodes. In the E. Specifically, ρ(ei, ej ) represents the pairwise relatedness

1038 M. Jordan A.I. Predicate-Predicate Edges. We use ρ(p , p ) to denote � = � = i j � (professor) � (movie) the weight of the predicate-predicate edge between a pair of instance of: human instance of: movie predicate nodes pi, pj ∈ P. Specifically, ρ(pi, pj ) represents nationality: USA country of origin: USA the pairwise relatedness between pi and pj . Note that a language: English language: English predicate-predicate edge only exists between a pair of pred- work place: California filming location: California occupation: professor icate nodes connected to different relation phrase nodes. field of study: computer science Similar to the entities, the keyphrases of a predicate p in topic of interest: machine learning the target KB is denoted as K(p), which is a set of predicate- �� = AI field of study (CS branch) object pairs that describe p in the KB. Take Figure 3 as an �� = (specialization of a person) example, where we have K(p1) = {(instance of, Wikidata instance of: academic discipline property), (related property, field of work)}. instance of: Wikidata property has part: machine learning related property: field of work field: computer science Let (pm, om) denote a keyphrase tuple, we define ρ(pi, pj ) as the pairwise relatedness between pi and pj as follows. Figure 3: Keyphrases in entities and predicates. P ω(pm) · ω(om) om∈Ko(pi)∩Ko(pj ) ρ(pi, pj ) = P (3) ω(pm) between ei and ej . An entity-entity edge only exists between om∈Ko(pi)∩Ko(pj ) a pair of entity nodes connected to different noun phrase nodes. The reason is that under the global coherence as- Entity-Predicate Edges. Let ρ(ei, pj ) denote the weight sumption, to select the optimal entity for a noun phrase, we of the entity-predicate edge between an entity node ei ∈ E only consider whether it is closely related to the candidate and a predicate node pj ∈ P. Specifically, ρ(ei, pj ) repre- entities of other noun phrases in the same document. For sents the pairwise relatedness between ei and pj . example, in Figure 3, there is no edge between M. Jordan We determine that ρ(ei, pj ) = ω(pj ) if both conditions (basketball player) and M. Jordan (professor), since they are are satisfied: (1) the noun phrase node ni linked to ei and connected to the same noun phrase node Michael Jordan. the relation phrase node ri linked to pi are in the same ∗ We propose a novel solution to formulate the pairwise triple t ∈ T (d); and (2) a fact (ei, pj , x) exists in the KB relatedness between entities ei and ej in a target KB Ψ, Ψ and x 6= NULL (x can be either an entity or a literal). by utilizing their keyphrases extracted from Ψ. We first Otherwise, ρ(ei, pj ) = 0. Specifically, we use ω(pj ) as the formally define the keyphrases for an entity e in a KB Ψ. value of ρ(ei, pj ), because we need to avoid the cases where popular predicates are linked to too many entities and hence Definition 4. Keyphrases of Entities. The keyphrases dominate the other predicates. of an entity e in the KB Ψ, denoted as K(e), is the set of predicate-object pairs of e stored in Ψ. Particularly, suppose 5. GRAPH DENSIFICATION & LINKING that there are m facts stored in Ψ where all of them have en- In this section, our task is to wisely conduct the joint tity e as subjects, i.e, {(e, p1, o1), (e, p2, o2), ..., (e, pm, om)}, entity and relation disambiguation and linking. Formally, we have K(e) = {(p1, o1), (p2, o2), ..., (pm, om)}. Further- given a semantic graph G = (V,E), where V = N ∪E∪R∪P, more, we denote Kp(e) = {p1, p2, ..., pm} as the predicate- we select one optimal entity node e ∈ E for each noun phrase keyphrases of entity e, and Ko(e) = {o1, o2, ..., om} as the node n ∈ N , and one optimal predicate node p ∈ P for each object-keyphrases of entity e. relation phrase node r ∈ R (as shown in Figure 2). Specifically, the optimal linking result is expected to rep- For instance, in Figure 3, K(e3) = {(instance of, academic resent global coherence among these entities and predicates, discipline), (has part, machine learning), (field, computer where each entity and predicate is related with at least one science)}, while Kp(e3) = {instance of, has part, field}. of other entities and predicates in the result. Following the To measure the relatedness of entity pairs, we first assign community-search problem [59], we seek to transform this weights to the predicate-keyphrases and object-keyphrases candidate selection problem into a dense subgraph problem. of the entities based on the Inverse Document Frequency In this way, the subgraph that contains the optimal candi- (IDF), which measures how important each keyphrase is in a date entities and predicates is densely connected. global view. The keyphrases with higher IDF weights will be We discuss the definition of subgraph density and the def- considered as more uncommon and important. For example, inition of our dense subgraph problem in Section 5.1. We in Figure 3, although e1 shares more object-keyphrases with propose a greedy algorithm to address the dense subgraph e2 (USA, English, and California) than e3 (machine learning problem in Section 5.2. In Section 5.3, we deploy the greedy and computer science), e1 is more related to e3 since the algorithm on the KBPearl system effectively and efficiently. object-keyphrases they share are more uncommon. Specifically, for a predicate-keyphrase pm, we compute 5.1 The Dense Subgraph Problem |EΨ| its weight as ω(pm) = log2 , where Epm denotes the |Epm | set of entities that contain predicate pm in its keyphrases, 5.1.1 Definition of Subgraph Density while |EΨ| is the total number of entities in the target KB In this step, we discuss the best definition of subgraph Ψ. Similarly, for an object-keyphrase om, we have ω(om) = density, so that the global coherence of the candidate entities |EΨ| and predicates is captured. log2 |E | . Let (pm, om) denote a keyphrase tuple, we define om First, we determine the formulation of the node degrees in the pairwise relatedness between entity ei and ej as follows. the semantic graph. To capture global coherence among the P ω(pm) · ω(om) entity nodes and predicate nodes in the desired subgraph, om∈Ko(ei)∩Ko(ej ) ρ(ei, ej ) = P (2) we need to examine whether a given node in the graph has ω(pm) om∈Ko(ei)∩Ko(ej ) a strong connection with the other nodes. Hence, we define

1039 the weighted degree of a node v in the graph to be the total Algorithm 1 GraphDensification weight of its incident edges. Formally, in the semantic graph Input: The semantic graph G = (V,E), where V = N ∪E∪R∪P; G = (V,E), let (v) denote the set of incident edges of the Specifically, ηG(v) denotes the neighbour nodes of v ∈ V in node v, and w(µ) represent the weight of an edge µ ∈ E, the graph G, and G(v) denotes the incident edges of v ∈ V in weighted degree of v is defined as: w(v) = P w(µ). graph G. µ∈(v) Output: The dense subgraph H = (V ,E ) which satisfies the Second, we determine how to measure the density of the H H constraints in Problem 2, where VH = N ∪ EH ∪ R ∪ PH . desired subgraph H. There are two methods for density 1: H ← G measurement in terms of the weighted degrees of its nodes. 2: V 0 ← sorted(V ); . Sort all v ∈ V based on their weights The first method regards the average degree of the nodes from least to greatest. 0 P w(v) 3: for each v ∈ list(V ) do v∈VH in H = (VH ,EH ) as its density, i.e., d(H) = , as 4: if v ∈ N or v ∈ R then continue; |VH | the density of H. This measurement has been extensively 5: else 6: for each n ∈ N do studied in the dense subgraph area [1, 59]. However, one 7: if v ∈ η (n) and |(η (n)| > 1 then drawback of this method is that an unrelated but densely G G 8: H ← (VH − v, EH − G(v)); connected community can be easily added to the graph to 9: for each r ∈ R do increase the average density. In other words, in our semantic 10: if v ∈ ηG(r) and |(ηG(r)| > 1 then graph, an entity node (or predicate node) that share heavy 11: H ← (VH − v, EH − G(v)); 0 weighted edges with half of the other nodes can easily dom- 12: for each e ∈ EH do P 0 inate an entity node (or predicate node) that connects with 13: if 0 ρ(x, e ) < θ then x∈ηH (e )∧x∈N / H every other node. Take the entity node M. Jordan (basket- 14: Reports a new entity for n ∈ N linked to e0; ball player) in Figure 2 as an example. The total weight 15: return H; of this node will be high, just because it has rather high popularity and has a heavy edge linked to the noun phrase Michael Jordan, but not because of its dense connection to other entity nodes. Therefore, popular nodes such as M. reduction of this instance to the Steiner tree problem with Jordan (basketball player) may easily dominate other nodes. unit weights. It is known that the Steiner tree problem is Hence, this measurement may cause failure to select a group NP-hard [31] [6](Theorem 20.2). In the decision version of of nodes where global coherence is expected to be achieved. the Steiner tree problem with unit weights, given a graph The second method, which is used in this paper, is to re- G = (V,E), a set of nodes T ⊆ V (usually referred to as gard the minimum degree of the nodes in H as the density of terminals), and an integer k, we need to determine whether H, i.e., d(H) = min w(v). This measurement does not suf- there is a subtree of G which contains all nodes in T with v∈VH at most k edges. Specifically, we set k = 2 · |T | − 1 in our fer from the failure cases that some popular nodes dominate problem. the others and form a non-related dense community. Maxi- Given an instance of the Steiner tree problem, with G as mizing the minimum degree of the nodes in H can guarantee the input graph, T as the set of terminal nodes, and 2·|T |−1 the global coherence goal such that the coherence among all an upper bound on the number of edges, we define a unit- the entities and predicates are taken into account. weight instance of the decision version of our problem as 5.1.2 Definition of the Dense Subgraph Problem follows. Given G as the input graph, T = N ∪R as the union set of the noun phrase nodes and relation phrase nodes, the Our goal is to compute a subgraph with maximum density, upper bound on the number of nodes is 2 · |T |, and we need while observing constraints on the subgraph structure. to find a graph with a minimum degree of at least 1. Problem 2. The Dense Subgraph Problem. Given an We show that there is a solution for the Steiner tree prob- undirected connected weighted semantic graph G = (V,E), lem if and only if there is a solution for the unit-weight where V = N ∪ E ∪ R ∪ P, our target is to find an induced instance of our problem. First, any Steiner tree using at subgraph H = (VH ,EH ) of G, where VH = N ∪EH ∪R∪PH . most 2·|T |−1 edges is also a solution for our problem using Specifically, H satisfies all the following constraints: at most 2 · |T | nodes. Second, given a solution H for our problem containing at most 2 · |T | nodes, we can compute a • H is connected; Steiner tree with at most 2 · |T | − 1 edges by simply taking • H contains every noun phrase node and every relation any spanning tree of H. phrase node, i.e., N ⊆ VH and R ⊆ VH ; • Every noun phrase node in H is connected to one entity node, and every relation phrase node in H is connected to one predicate node. A noun phrase node and a relation 5.2 Greedy Algorithm phrase node cannot be linked to each other directly. More- As discussed above, finding a dense subgraph in the se- over, an entity node can be connected to more than one mantic graph G with bounded size (Problem 2) is an NP- noun phrase node, and a predicate node can be connected hard problem. To address this problem, we extend the to more than one relation phrase node. Therefore, H has greedy algorithm for finding strongly interconnected and at most 2 · (|N | + |R|) nodes, i.e., |VH | ≤ 2 · (|N | + |R|); size-limited groups in social networks [28, 59], while also lo- • The minimum degree of H, i.e., min w(v), is maximized cating the new entities. v∈VH among all feasible choices for H. The pseudo-code of the greedy algorithm is presented in Algorithm 1. We start with the full semantic graph G = Theorem 1. Problem 2 is NP-hard. (V,E). We first sort all the nodes v ∈ V in G, based on Proof. To prove that Problem 2 is NP-hard, we give their weights from least to greatest (line 2). We then itera- a unite-weight instance of Problem 2, as well as a simple tively remove the entity nodes and predicate nodes with the

1040 Algorithm 2 KBPearl-Pipeline Algorithm 3 KBPearl-NearNeighbour

Input: The sentence group of the input document G = Input: The input document d = {a1, a2, ..., a|d|}, where each {g1, g2, ..., g|G|}, where each of {g1, g2, ..., g|G|−1} contains k ai is the sentence in d; side information S(d); canonicalized ? sentences, and g|G| contains (|G| mod k) sentences. triples T (d); maximum number of near neighbour of noun Output: F (d), the final linking results of the input documents. phrase distance k. Note that F (d) contains all the noun phrase-entity edges and Output: F (d), the final linking results of the input documents, relation phrase-predicate edges in each Hi for gi ∈ G. which contains all the noun phrase-entity edges and relation 0 1: ψe ← ∅; . Record set of the noun phrase-entity edges. phrase-predicate edges in the dense subgraph H of G. 0 ? 2: ψp ← ∅; . Record set of the relation phrase-predicate edges. 1: Generate V = N ∪ E ∪ R ∪ P from S(d) and T (d) (Section 3: F (d) ← ∅; . The final linking results. 4.1), E includes all the noun phrase-entity edges for each 4: for each gi ∈ G do n ∈ N , and all the relation phrase-predicate edges for each 5: Build semantic graph Gi for the document based on gi; r ∈ R; we use ηk(ni) to record the k noun phrase as near 0 0 neighbours of the noun phase n ∈ N , E(n ) to denote the 6: Gi ← ψe, ψp; i i candidate entities of ni ∈ N , P(ri) to denote the candidate 7: Hi ← GraphDensification(Gi); . Employ Algorithm 1. predicate of ri ∈ R, sen(ai) to record the noun phrases and 8: Obtain ψe(Hi) and ψp(Hi) from Hi as the linking results; temp 0 temp 0 relation phrases in the sentence ai in d. 9: ψe ← ψe(Hi) − ψe, ψp ← ψp(Hi) − ψp; 2: for each ni ∈ N do 0 temp 0 temp 10: ψe ← ψe , ψp ← ψp ; 3: for each nj ∈ ηk(ni) do 11: F (d) ← ψe(Hi), ψp(Hi); 4: for each ei ∈ E(ni) AND each ej ∈ E(nj ) do 12: return F (d); 5: if ei 6= ej then E ← ρ(ei, ej ); 6: for each ai ∈ d do 7: for each ri ∈ sen(ai) do 8: for each rj ∈ sen(aj ) do smallest weighted degree (lines 3 - 11). Specifically, to guar- 9: for each pi ∈ P(ri) AND each pj ∈ P(ri) do antee that we capture the coherence of the linkings for all 10: if pi 6= pj then E ← ρ(pi, pj ); the entity nodes and predicate nodes, we enforce each noun 11: for each ni ∈ sen(ai) do phrase node to remain connected with one entity node (lines 12: for each rj ∈ sen(aj ) do 6 - 8), and each relation phrase node to remain connected 13: for each ei ∈ E(ni) AND each pj ∈ P(ri) do with one predicate node (lines 9 - 11). The algorithm ter- 14: E ← ρ(ei, pj ); minates when (1) at least one of the noun phrase nodes or 15: G ← (V,E) 16: H ← GraphDensification(G); . Employ Algorithm 1. relation phrase nodes has the minimum degree in H, or (2) 17: F (d) ← ψe(H), ψp(H); a noun phrase node n ∈ N or a relation phrase node r ∈ R 18: return F (d); will be disconnected in the next loop. To recognize the new entities, we further conduct a post- Michael Jordan is a professor at the University of 0 processing task. For each entity node e ∈ EH in the dense California, Berkeley. He visited the BAAI Conference subgraph H, we compute the sum of the weights of the in- held in Beijing, China in November, 2019. cident edges of e0 in H (except for the noun phrase-entity edges). If it is smaller than a given threshold θ, we report a 0 M. Jordan University of California, Berkeley new entity for the noun phrase linked to e (lines 12 - 14). (professor) (university) We discuss the learning of the threshold θ in Section 6.1. The time complexity of Algorithm 1 is analyzed as fol- lows. We record the weights of each node during the graph BAAI Conference Beijing China (country) construction process and use a list to record the neigh- (new entity) (city) bour nodes of each node. The loop in lines 3 to 11 takes 0 0 Figure 4: The coherence of entities in the sam- O(|V |) = O(|V |) time to traverse all the node v ∈ V . Lines ple document where entity mentions are underlined. 6 to 8 search for every noun phrase node n ∈ N , which takes The edge represents real semantic relatedness be- O(|N |) time. Similarly, lines 9 to 11 take O(|R|) time. To lo- tween entities recorded in KB. cate new entities, lines 12 to 14 need O(|EH |) time. To sum- marize, let β = |N |+|R|, the time complexity of Algorithm 1 common, especially when there are new entities that can- is O(|V |·(|N |+|R|)+|EH |) ≤ O(|V |·(|N |+|R|)+|N |+|R|) = not be connected to any other nodes. Current researches O(β · |V | + β) ≈ O(β · |V |). Specifically, since we reduce the that propose to find the top-k densest subgraphs in a large size of the semantic graph via the triple canonicalization in graph [4,23] and overlapping community detection [65] par- Step 1.3 of Stage 1, β will be a relatively small number for tially address this issue. However, all these methods re- one document. quire a large graph that contains the weight information of every edge as input. This requires the pre-computation of 5.3 Efficient and Effective Linking each entity-entity edge, predicate-predicate edge and entity- However, the assumption that all entities mentioned in a predicate edge. document are densely connected with each other in a KB is Moreover, in the aspect of efficiency, suppose that the to- not always correct. This assumption will hurt the perfor- tal number of entity nodes (resp., predicate nodes) in the mance in both effectiveness and efficiency. semantic graph is |E| (resp., |P|), computing the weights of |E| |P| In the aspect of effectiveness, in a long-text document, not every entity-entity edges requires 2 (resp., 2 ) compu- every pair of entities or predicates shares strong relatedness. tations. Even if the task is parallelizable, overcoming such Take the case illustrated in Figure 4 as an example (we omit complexity is necessary to achieve satisfactory scalability, the predicates for simplicity). Among the five entities, only especially for the on-the-fly tasks such as real-time KBP. two pairs of them are closely related. This indicates that In order to address the issue mentioned above, we propose the sparse coherence between the entities in documents is two variants of the KBPearl system as solutions.

1041 KBPearl in Pipeline Mode (KBPearl-PP). We show • NYT2018 contains news articles collected from the New the pseudo-code of the pipeline mode in Algorithm 2. Given York Times (nytimes.com) in 2018 from various domains [32]. a document as input, we first separate it into sentence groups We select 30 documents and manually annotate the facts G = {g1, g2, .., g|G|} (the sentence tokenization can be ob- based on Wikidata and DBpedia as ground truths. The tained in Step 1.1), where each group {g1, g2, .., g|G|−1} con- average document length is 206.20 words. tains k sentences and the last group g|G| contains (|G| mod k) • LC-QuAD2.0 [15] comprises 6046 complex questions for sentences. For each document composed of gi ∈ G, we build Wikidata (80% questions are with more than one entity semantic graph Gi (line 5), and perform Algorithm 1 to ob- and relation). We select 1942 questions for testing, where tain a dense subgraph Hi (line 7). Specifically, this task is the average question length is 14.13 words. processed in sequential order, and the linking results in Hi • QALD-7- [60] is one of the most popular bench- will be utilized to derive Hi+1 (lines 6 - 10). mark datasets for . This dataset is We analyze the time complexity of the pipeline mode in mapped over Wikidata, comprising 100 questions. The Algorithm 2 as follows. For a document with γ sentences, average question length is 6.87 words and over 50% of the γ there will be |G| = k + 1 iterations. Since the complexity of questions include a single entity and relation. generating Hi based on Gi in each iteration is O(βi ·|Vi|+βi) • T-REx [18] is a benchmark dataset which evaluates KBP, (see Section 5.2), where Vi is the total number of nodes in relation extraction, and question answering. We select γ Gi, the total time complexity is O(( k + 1) · (βi · |Vi| + βi)) = 1815 documents with Wikidata triples aligned as ground γ γ O(( k + 1) · βi · (|Vi| + 1)) ≈ O( k βi · |Vi|). Note that βi is truths. The average document length is 116.14 words. the total number of noun phrases and relation phrases in • Knowledge Net [39] is a benchmark dataset for auto- the sentence group where gi is built on, which is relatively matic KBP with facts extracted from natural language small because of the canonicalization in Step 1.3. text on the web. It contains 3977 documents, and 6920 KBPearl in Near-Neighbour Mode (KBPearl-NN). triples linked to Wikidata. The average document length We show the pseudo-code of the near-neighbour mode in Al- is 157.22 words. gorithm 3. Specifically, to construct the edges of a semantic • CC-DBP [26] is a benchmark dataset for KBP based on graph (lines 2 - 14), the entity-entity edges are computed Common Crawl and DBpedia. We select 969 documents only if the pair of the related noun phrases are within k dis- with non-empty triples as ground truths. The average tance in the document d (lines 2 - 5). For example, in Figure document length is 18.39 words. 4, the distance between Michael Jordan and Beijing is 3, Specifically, the original CC-DBP datasets only provide while the distance between BAAAI Conference and Beijing is 1. The predicate-predicate edges are computed only if the ground truths that linked to DBpedia, while T-REx, Knowl- pair of the related relation phrases are in the same sentence edge Net, LC-QuAD2.0, and QALD-7-Wiki provide ground truths linked to Wikidata. In order to utilize all the datasets (lines 6 - 10). The entity-predicate edges are computed only 1 if the related noun phrase and relation phrase are in the for evaluation, we employ SPARQL query . We produce same sentence (lines 11 - 14). the entity mapping between DBpedia and Wikidata based Suppose that α is the average number of candidates for a on the relation owl:sameAs in DBpedia, and the relation noun phrase or relation phrase. Originally, constructing the mapping based on the property owl: equivalentProperty in edges among all entity nodes and predicate nodes in the se- DBpedia as well as the predicate P1628 (“equivalent prop- mantic graph G take O(|E|2 + |P|2 + |E||P|) = O(α2|N |2 + erty”) in Wikidata. To ensure fairness, we only evaluate the α2|R|2 + α2|N ||R|) = O(α2(|N |2 + |R|2 + |N ||R|)) time. documents where both the DBpedia and Wikidata ground Suppose that the average number of noun phrases and rela- truths are not null. tion phrases in a sentence is w, and let |d| denote the average Baselines. We compare the performance of our system with number of sentences in one document. Algorithm 3 reduces several state-of-the-art techniques listed as follows. this time complexity to O(α2k|N | + |d|(w2α2 + w2α2)) = O(α2(k|N | + 2|d|w2)). To achieve promising efficiency, k is • DBpedia Spotlight [11,38] is a popular baseline for En- set as a small number compared with |N | and |R|. tity Linking in TAC KBP [35, 37, 51] based on DBpedia. We adopt the best configurations suggested in [38], where 6. EXPERIMENTS annotation confidence is set as 0.6 and support to be 20. • TagMe [20] is a popular baseline for Entity Linking in In our experiments, we first compare the performance of TAC KBP datasets [30, 49] and Semantic Evaluation [50] different variants of KBPearl on the joint entity and relation based on DBpedia. We follow all the suggested configura- linking task against the state-of-the-art techniques. We then tions, and filter the linking results with a confidence lower demonstrate the ability of KBPearl on determining new en- than 0.3. tities. Third, we validate the effectiveness and efficiency of • ReMatch [42] is one of the top-performing tools used KBPearl based on the length of the source document. for the Relation Linking task, especially in the question- 6.1 Experiment Settings answering community [16, 52, 58]. We follow all the sug- gested configurations. Specifically, we set the minimum Benchmarks. We validate the performance of KBPearl (resp., maximum) length of combinatorials of input text and the baselines on several real-world datasets. to be 2 (resp., 3), and the winner threshold to be 0.6. • Falcon [52] performs joint entity and relation linking of • ReVerb38 is a dataset proposed for Open KB canoni- short text, leveraging several fundamental principles of calization [61] with 45K triples extracted from English morphology and an extended KB which merges Clueweb09 [8] corpus. We manually annotate facts in 38 entities and relations from DBpedia and Wikidata. We documents based on Wikidata and DBpedia as ground truths. The average document length is 18.34 words. 1https://www.w3.org/TR/rdf-sparql-query/

1042 adopt all the default configurations. The maximum length label 127 noun phrases as emerging entities. We train θ of combinatorials of input text is set to be 3. based on LBFGS optimization. Specifically, we exclude • EARL [16] also performs entity linking and relation link- all the validation data where the ground truths are used ing as a joint task, by formulating and solving a Gener- to train the hyper-parameters in evaluation. alized Traveling Salesman Problem among the candidate nodes in the . We adopt all the default Evaluation Metrics. We use precision (P), recall (R) and configurations suggested by the API of EARL. F1-score (F) for performance evaluation. Precision is de- • QKBfly [44] is proposed to dynamically conduct informa- fined as the number of correct linkings provided by the sys- tion extraction tasks based on ClausIE [13] and produce tem, divided by the total number of linkings provided by canonicalized triples linked to KB in an on-the-fly man- the system. Recall is defined as the number of correct link- ner. We obtain the code of QKBfly from the authors, and ings provided by the system, divided by the total number adopt all the default configurations suggested in the pa- of ground truths. F1-score computes the harmonic mean of 2∗P ∗R per. Note that QKBfly does not perform the relation link- precision and recall, i.e., F 1 = P +R . Please refer to [48] ing task, but only canonicalizes the relation phrases based for a more detailed report and examples of calculation. on PATTY [43], a relational pattern dictionary. Hence, we Experiment Setting. All the experiments presented are do not compare it in our relation linking task. conducted on a server with 128GB RAM, 2.4GHz CPU, with • KBPearl-PP and KBPearl-NN are our system that ex- Ubuntu 18.04.1 LTS installed. ecute in the pipeline mode and the near-neighbour mode, respectively. 6.2 Performance Evaluation Performance of KBPearl and Baselines on the Joint Specifically, some of the tools (DBpedia Spotlight, TagMe, Entity and Relation Linking Task. We first evalu- Falcon, and EARL) are designed to process short text. To ate the results of the entity linking of our system and the make them feasible on long-text documents, we follow the state-of-the-art techniques. As shown in Table 1, we con- setting in TagMe [20] to conduct sentence tokenization on duct the experiments on 7 datasets, and compare the per- the documents with more than 30 words. One document formance of our system with 5 other competitors, includ- is divided into a list of sentences to pass to these tools for ing Falcon, EARL, DBpedia Spotlight (referred to as Spot- further processing. light for simplicity), TagMe, and QKBfly. On the short- Implementation Details of KBPearl. We follow the text datasets (LC-QuAD2.0, QALD-7-Wiki, CC-DBP, and Opentapioca project [14] to index 64,126,653 entities and ReVerb38), both variants of KBPearl achieve satisfactory predicates from the Wikidata JSON dump of 2018-07-22 performance. Moreover, on the long-text datasets (T-REx, 2 with Solr (Lucene) . We list the implementation details of Knowledge Net, and NYT2018), both variants of KBPearl our system as follows. significantly outperform the baselines in the precision and F1-score. Specifically, KBPearl-NN obtains the best perfor- • In Side information Extraction (Step 1.1 in Stage 1), we mance of all long-text datasets. This is because that long- employ the Washington KnowItAll project for sentence text documents contain more valuable evidence to capture tokenization3, the NLTK Toolkit [34] for part-of-speech the coherence among the knowledge in the source text. (POS) tagging, the Stanford CoreNLP toolkit [36] for On the other hand, both variants of KBPearl achieve the noun-phrase chunking and named entity recognition, and second-best precision on LC-QuAD2.0 and QALD-7-Wiki, SUTIME [9] for Time Tagger. while Spotlight and Falcon obtain better results. One rea- • In Triple Extraction (Step 1.2 in Stage 1), we employ four son is that both datasets provide interrogative sentences as different Open IE tools for performance evaluation, in- text. The quality of Open IE results on interrogative sen- cluding ReVerb [19], MinIE (in safe mode) [24], ClausIE tences may not be as satisfactory as on declarative sentences. [13] and Stanford Open IE Tool [2]. Specifically, we choose Moreover, KBPearl tends to extract more knowledge from one of the four tools to execute our system each time. The the text. For example, for the short-text input “ evaluation details are presented in Section 6.2. Is the ”, apart from the enti- • In Semantic Graph Construction (Step 2.1 in Stage 2), wife of Obama called Michelle? ties for and , KBPearl also provides entity we obtain the aliases of each entity and predicate item Obama Michelle Q188830 (wife) based on , even though it also provides from Wikidata by directly querying the values of its aliases wife predicate P26 (spouse) based on . Therefore, the pre- property. Moreover, to compute the weight of keyphrases, wife cision of KBPearl is affected due to such false positives. we select the top 5% of entities with the largest number Note that KBPearl, TagMe, and Spotlight always achieve of sitelinks, and 5% of predicates with the largest number relatively high recall. The reason is that they provide more of statements as our samples for calculation. The rest of linking results than the other tools. Therefore, although the keyphrases are assigned the maximum weight in the some of their results are not correct (low precision), they sample set of keyphrases during the computation. have covered most of the truths (high recall). • For the hyper-parameter training of θ in the greedy al- We then evaluate the relation linking task. We conduct gorithm to derive the dense subgraph (Step 2.2 in Stage the experiments on 5 datasets, and compare the perfor- 2, presented in Section 5.2), we utilize the ground truths mance of our system with 3 other competitors, including labeled for the NYT2018 dataset. For the noun phrases Falcon, EARL, and ReMatch. Specifically, two datasets are detected as named entity mentions and are reported to be excluded. One is Knowledge Net, which does not provide non-linkable, we manually label whether they are emerg- the relation linking results as standard Wikidata predicates. ing entities (i.e., entities not contained in Wikidata). We The other is CC-DBP, where most of the triples (over 80%) 2https://lucene.apache.org/solr/ are not labeled with relations. As shown in Table 2, KB- 3https://github.com/knowitall Pearl significantly outperforms all the other baselines in the

1043 Table 1: Performance of the entity linking task. Table 2: Performance of the relation linking task. Both KBPearl-PP and KBPearl-NN employ MinIE Both KBPearl-PP and KBPearl-NN employ MinIE for Open triple extraction. The best and the second- for Open triple extraction. The best and the second- best performance values are in bold. best performance values are in bold. Dataset Dataset (# of words System Precision Recall F1 (# of words System Precision Recall F1 per document) per document) Falcon 0.498 0.464 0.480 Falcon 0.137 0.220 0.169 EARL 0.496 0.535 0.515 EARL 0.280 0.431 0.339 ReVerb38 Spotlight 0.642 0.605 0.623 ReMatch 0.292 0.223 0.253 ReVerb38 (18.34) TagMe 0.512 0.661 0.577 KBPearl-PP 0.403 0.452 0.426 (18.34) QKBfly 0.724 0.528 0.611 KBPearl-NN 0.403 0.452 0.426 KBPearl-PP 0.643 0.664 0.653 Falcon 0.151 0.203 0.173 KBPearl-NN 0.643 0.664 0.653 EARL 0.201 0.245 0.221 NYT2018 Falcon 0.275 0.470 0.347 ReMatch 0.218 0.297 0.251 (206.20) EARL 0.282 0.526 0.367 KBPearl-PP 0.302 0.405 0.346 Spotlight 0.393 0.524 0.449 NYT2018 KBPearl-NN 0.313 0.494 0.383 TagMe 0.114 0.528 0.188 (206.20) Falcon 0.302 0.325 0.313 QKBfly 0.410 0.477 0.441 EARL 0.259 0.251 0.255 KBPearl-PP 0.422 0.744 0.539 LC-QuAD2.0 ReMatch 0.201 0.214 0.207 KBPearl-NN 0.491 0.694 0.575 (14.13) KBPearl-PP 0.566 0.479 0.410 Falcon 0.533 0.598 0.564 KBPearl-NN 0.566 0.479 0.410 EARL 0.403 0.498 0.445 Falcon 0.337 0.329 0.333 Spotlight 0.585 0.657 0.619 LC-QuAD2.0 EARL 0.214 0.222 0.218 TagMe 0.352 0.864 0.500 QALD-7-Wiki (14.13) ReMatch 0.208 0.203 0.205 QKBfly 0.518 0.479 0.498 (6.87) KBPearl-PP 0.482 0.399 0.437 KBPearl-PP 0.561 0.647 0.601 KBPearl-NN 0.482 0.399 0.437 KBPearl-NN 0.561 0.647 0.601 Falcon 0.026 0.072 0.038 Falcon 0.708 0.651 0.678 EARL 0.126 0.157 0.140 EARL 0.516 0.460 0.486 T-REx ReMatch 0.125 0.151 0.137 Spotlight 0.619 0.634 0.626 (116.14) QALD-7-Wiki KBPearl-PP 0.136 0.188 0.157 TagMe 0.349 0.661 0.457 (6.87) KBPearl-NN 0.139 0.187 0.159 QKBfly 0.592 0.510 0.548 KBPearl-PP 0.647 0.715 0.679 KBPearl-NN 0.647 0.715 0.679 relation linking task. The reason is that some of the relations Falcon 0.141 0.203 0.167 expressed in the natural language text have various repre- EARL 0.305 0.505 0.380 sentations. For example, in the sentence “Barack Obama Spotlight 0.328 0.472 0.387 was the US President”, the correct extraction should be T-REx TagMe 0.057 0.064 0.060 (116.14) (Q76, P39, Q11696 )(Barack Obama, political office held, QKBfly 0.248 0.271 0.259 President of the United States). However, all the systems KBPearl-PP 0.329 0.513 0.401 recognize the more general predicate P31 (instance of ) in- KBPearl-NN 0.340 0.554 0.421 stead of P39 (political office held) from the original sentence. Falcon 0.253 0.380 0.304 Performance of KBPearl based on different Open EARL 0.234 0.388 0.292 IE tools. We also conduct experiments to evaluate the Spotlight 0.255 0.382 0.306 Knowledge Net performance of KBPearl when employing different Open IE TagMe 0.254 0.362 0.299 (157.22) QKBfly 0.322 0.472 0.382 tools to extract the knowledge from natural language text KBPearl-PP 0.297 0.446 0.356 in Step 1.2 of Stage 1. Specifically, MinIE (in safe mode) KBPearl-NN 0.327 0.466 0.384 [24], ReVerb [19], Stanford Open IE Tool [2] (referred to as Standford for simplicity), and ClausIE [13] are studied. Falcon 0.244 0.303 0.270 Note that we only involve Falcon and EARL as baselines for EARL 0.277 0.605 0.380 Spotlight 0.382 0.604 0.468 comparison, because they are the only tools that perform CC-DBP TagMe 0.231 0.615 0.336 both entity linking task and relation linking task. (18.39) QKBfly 0.339 0.433 0.380 As listed in Table 3, all the variants of KBPearl based KBPearl-PP 0.415 0.620 0.497 on different Open IE tools outperform the baselines in 4 KBPearl-NN 0.418 0.620 0.499 datasets. MinIE achieves the best and second-best F1 score of all the time. The reason is that MinIE provides more compact extractions, and therefore gives most of the correct relation linking task in both short-text data and long-text linkings, which leads to high recall. Specifically, the exper- data. Specifically, the performance of all systems on T-REx imental results indicates that there will not be a very huge is relatively low. The reason is that T-REx provides less difference in the performance of KBPearl when employing number of relation linking ground truths for long-text doc- different Open IE tools to extract knowledge. The reason uments that actually contain more relations. is that we have conducted noise filtering, redundancy filter- As presented in Table 1 and Table 2, the performance of ing, and canonicalization on the Open triples at Step 1.3, to all the systems on the entity linking task is better than the reduce the redundant triples.

1044 Table 3: Performance of joint entity and relation linking of KBPearl-NN based on different Open IE tools employed in Step 1.2, Stage 1. “#Extr.” denotes the total number of triples after the noise filtering and canonicalization in Step 1.3, Stage 1. The precision (P) and recall (R) are the average value of those for the entity linking task and relation linking task. The best and the second-best performance values are in bold. QALD-7-Wiki T-REx ReVerb38 NYT2018 Systems #Extr. P R F1 #Extr. P R F1 #Extr. P R F1 #Extr. P R F1 KBP-MinIE 200 0.565 0.557 0.561 373,553 0.239 0.370 0.291 183 0.523 0.558 0.540 402 0.402 0.594 0.479 KBP-ReVerb 59 0.631 0.481 0.546 100,063 0.353 0.243 0.288 67 0.606 0.448 0.515 158 0.516 0.330 0.403 KBP-Stanford 189 0.560 0.539 0.549 306,659 0.251 0.330 0.285 146 0.517 0.545 0.531 350 0.481 0.459 0.470 KBP-ClausIE 175 0.554 0.549 0.551 299,582 0.287 0.395 0.332 132 0.503 0.563 0.531 325 0.502 0.443 0.471 Falcon N.A 0.523 0.490 0.506 N.A 0.084 0.138 0.104 N.A 0.318 0.342 0.329 N.A 0.163 0.337 0.220 EARL N.A 0.365 0.341 0.353 N.A 0.216 0.331 0.261 N.A 0.388 0.483 0.430 N.A 0.192 0.385 0.256

1 1 1 1 KBPearl-Sim KBPearl-PP Precision Recall F-measure 0.9 KBPearl-Sim KBPearl-PP 0.9 0.9 KBPearl-Sim KBPearl-PP 0.9 0.8 KBPearl-NN 0.8 KBPearl-NN 0.8 KBPearl-NN 0.8 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0 0

T-REx ReV NYT LQ QA 10 20 30 40 50 80 5 10 15 20 25 30 35 40 KBPearl-PP KBPearl-NN QKBfly Datasets # of Words100 120 150 175 200 250 280 # of NPs and RPs Systems (a) F1-measure of differ- (b) F1-measure of differ- (c) F1-measure of differ- (d) Discovery of new enti- ent KBPearl variants on ent KBPearl variants with ent KBPearl variants with ties and predicates. different datasets. different # of words. different # of phrases. Figure 5: Performance Evaluation.

40 100 1 1 Precision Recall F1 Precision Recall F1 35 KBPearl-Sim KBPearl-PP 0.9 0.9 80 0.8 0.8 30 KBPearl-PP KBPearl-NN 0.7 0.7 25 KBPearl-NN 60 0.6 0.6 20 0.5 0.5 40 0.4 0.4 Time (s) 15 Time (s) 0.3 0.3 10 20 0.2 0.2 5 0.1 0.1 0 0 0 0 0 25 50 75 100 125 150 175 200 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 3 4 5 6 7 8 9 10 # of Words k k k (a) Computation Time (b) Computation Time (a) KBPearl-PP (b) KBPearl-NN Figure 6: Efficiency evaluation regarding the length Figure 7: Parameter Sensitivity in KBPearl- of the documents and different values of parameter Pipeline (KBPearl-PP) and KBPearl-Near Neigh- k of the KBPearl variants. bour (KBPearl-NN).

Performance of Different Variants of KBPearl. We We further investigate the performance of different vari- also evaluate the performance of KBPearl in pipeline mode ants of KBPearl in terms of the length of the document. (KBPearl-PP), KBPearl in near-neighbour mode (KBPearl- Specifically, the length of the documents is determined based NN), and KBPearl which directly constructs the whole se- on the number of words (Figure 5(b)), and the number of mantic graph and conduct graph densification without filter- noun phrases and relation phrases in the canonicalized Open ing sparse coherence (KBPearl-Sim). Figure 5(a) illustrates triples (Figure 5(c)). The performance of KBPearl is based the F-measure of the three variants in joint entity and rela- on MinIE. KBPearl-PP and KBPearl-NN achieve satisfac- tion linking task, on T-REx, ReVerb38 (ReV), NYT2018 tory performance with both short-text and long-text docu- (NYT), LC-QuAD2.0(LQ), and QALD-7-Wiki (QA). We ments, while the F1-measure of KBPearl-Sim decreases with also present the detailed performance of KBPearl-PP and long documents. This also demonstrates the effectiveness of KBPearl-NN in Table 1 and Table 2. For short-text datasets, the pipeline mode and near-neighbour mode of KBPearl. such as LC-QuAD2.0, QALD-7-Wiki, and ReVerb38, there is Moreover, Figure 6(a) presents the efficiency evaluation very little difference between the performance of all variants of KBPearl regarding the lengths of the sentences. We can of KBPearl. The reason is that the numbers of candidate observe that with more than 75 words, the time cost of entities and predicates are always limited in short text, while KBPearl-Sim increases in an exponential manner, while the the number of sentences is also limited. Therefore, limiting time cost of both KBPearl-PP and KBPearl-NN is still lin- the number of neighbours or the number of sentence groups ear to the number of words. will not have different performances. On the other hand, for Parameter Sensitivity in Different Variants of KB- long-text datasets such as T-REx and NYT2018, KBPearl- Pearls. We first study the time cost of KBPearl-PP and PP and KBpearl-NN perform better than KBPearl-Sim. This KBPearl-NN based on different k. As shown in Figure 6(b), proves the capability of our system to handle the sparse co- the time cost of KBPearl-NN is linear to k when it is smaller herence (illustrated in Figure 4) which is common in the than 8, while the time cost of KBPearl-PP is linear to k when real-world scenarios for most long-text documents. it is smaller than 5. We also investigate the performance

1045 of both variants based on different k. As demonstrated in match [42] is the first attempt in this direction, which mea- Figure 7(a), the precision of KBPearl-PP is affected with sures the semantic similarity between the relation phrases k larger than 4. Moreover, Figure 7(a) indicates that the and the predicates in KBs based on graph distances from precision of KBPearl-NN drops with k larger than 8. This Wordnet [40]. Moreover, SIBKB [58], AskNow [17], and suggests that a wise selection of k will be around 3 to 4 for QKBfly [44] utilize large dictionary resources for textual KBPearl-PP, and 6 to 7 for KBPearl-NN. The result indi- patterns that denote binary relations between entities (e.g., cates that in most cases, the information in 3 to 4 sentences PATTY [43]), to discover synonymous relation phrases in in a document is valuable enough for knowledge unstructured text and assist relation linking. Particularly, and linking with global coherence. On the other hand, when EARL [16] and Falcon [52] are the first two works that conducting entity disambiguation, knowledge of the 6 to 7 perform both the entity linking task and relation linking nearest noun phrases can give valuable support. task. Specifically, EARL exploits the connection density be- Results of Detecting New Entities. As demonstrated tween entity nodes in KBs, while Falcon leverages the funda- in Figure 5(d), we evaluate the performance of KBPearl- mental principles of English morphology and extended self- PP, KBPearl-NN, and QKBfly on new entity discovery on integrated KBs without considering the coherence among NYT2018. We do not involve other baselines for compari- the entities and predicates. However, neither of them pro- son, since they do not create new entities for non-linkable cesses the non-linkable noun phrases. facts. Specifically, both variants of KBPearl achieve better Open IE. The Open IE methods [13, 47, 53] overcome the performance than QKBfly. The reason that all systems have limitations of KBP, but face issues that neither entities nor relatively low precision and high recall is that it is not trivial predicates in the Open triples are canonicalized. There are for them to recognize entities containing long phrases. For some recent research works that focus on the canonical- example, in a sentence containing “Girl from the North ization of Open triples by clustering the triples with the Country” (a song), both QKBfly and our system will ex- same semantic meaning. Gal´arraga [21] performs cluster- tract Girl and the North Country separately and report ing on noun phrases over manually-defined feature spaces new entities, which introduces false positives. to obtain equivalent entities, and clusters relation phrases Summary. In conclusion, the joint entity and relation link- based on rules discovered by AMIE [22]. FAC [64] solves ing task in KBPearl performs better than the state-of-the- Gal´arraga’s problem in a more efficient graph-based clus- art techniques, especially in long-text datasets. KBPearl tering method with pruning and bounding techniques. Fur- relies on Open IE tools for knowledge extraction, but with thermore, CESI [61] and SIST [32] conduct the canonical- side information preparation, the performance of KBPearl ization of entities and relations jointly based on the external will not be affected by the output quality of the Open IE and internal side knowledge. However, all these works only tools. Moreover, the two variants of KBPearl enhance the improve the output of Open IE techniques. None of them effectiveness and efficiency of our method on long-text doc- investigates the method to map and link the canonicalized uments. KBpearl is also capable of discovering new entities. facts to the existing KBs for enhancement and population. 8. CONCLUSION 7. RELATED WORKS In this paper, we present an end-to-end system, KBPearl, for KB population. KBPearl takes an incomplete KB and a Knowledge Base Population (KBP). Recently, a lot of large corpus of text as input, to (1) organize the noisy ex- research in the text analysis community focus on KBP from traction from Open IE into canonicalized facts; and (2) pop- external data sources [25, 29, 30, 46]. There are two ma- ulate the KB by joint entity and relation linking, utilizing jor tasks: (1) Entity Discovery and Linking, which extracts the facts and the side information extracted from the source the entities from the unstructured text and link them to an documents. Specifically, we employ a semantic graph-based existing KB, and (2) Slot Filling, i.e., Relation Discovery approach to capture the knowledge in the source document and Linking, which conduct information mining about re- in a global coherence, and to determine the best linking lations and types of the detected entities. However, most results by finding the densest subgraph effectively and effi- of the techniques proposed for KBP treat them as separate ciently. Our system is also able to locate the new entities tasks [25,29], and therefore the global coherence among the mentioned in the source document. We demonstrate the candidate entities and predicates cannot be utilized. More- effectiveness and efficiency of KBPearl against the state-of- over, declarative methods to IE and KBP, such as Deep- the-art techniques, through extensive experiments on real- dive [12,45,57] and SystemT [10], rely on predefined sets of world datasets. predicates and rules. Hence, these techniques are not suit- able for in-time KBP. QKBfly [44] is proposed as a solution Acknowledgments for dynamic KB construction and KBP. Nevertheless, it also We acknowledge all our lovely teammates and reviewers for relies on a pre-generated dictionary of relational patterns to their valuable advice on this paper. This work is partially generate canonicalized facts. supported by the Hong Kong RGC GRF Project 16202218, Entity Linking and Relation Linking. Most of the En- CRF project C6030-18G, AOE project AoE/E-603/18, the tity Linking (EL) works focus on three tasks: candidate National Science Foundation of China (NSFC) under Grant entity generation [27, 55, 66, 67], candidate entity ranking No. 61729201, Science and Technology Planning Project of [33, 54] and non-linkable mention prediction [55, 56]. How- Guangdong Province, China, No. 2015B010110006, Hong ever, all these works only assign the non-linkable mentions Kong ITC grants ITS/044/18FX and ITS/470/18FX, Didi- to a NIL entity, without further clustering and processing HKUST Joint Research Lab Grant, Microsoft Research Asia on the NIL entities. Furthermore, linking relation phrases Collaborative Research Grant, WeChat Research Grant, We- to predicates in KBs is a relatively new field of study. Re- Bank Research Grant, and Huawei PhD Fellowship.

1046 9. REFERENCES J. Lehmann. Earl: Joint entity and relation linking for [1] R. Andersen and K. Chellapilla. Finding dense question answering over knowledge graphs. In subgraphs with size bounds. In International International Conference, pages Workshop on Algorithms and Models for the 108–126. Springer, 2018. Web-Graph, pages 25–37. Springer, 2009. [17] M. Dubey, S. Dasgupta, A. Sharma, K. H¨offner,and [2] G. Angeli, M. J. J. Premkumar, and C. D. Manning. J. Lehmann. Asknow: A framework for natural Leveraging linguistic structure for open domain language query formalization in . In European information extraction. In Proceedings of the 53rd Semantic Web Conference, pages 300–316. Springer, Annual Meeting of the Association for Computational 2016. Linguistics and the 7th International Joint Conference [18] H. Elsahar, P. Vougiouklis, A. Remaci, C. Gravier, on Natural Language Processing (Volume 1: Long J. Hare, F. Laforest, and E. Simperl. T-rex: A large Papers), pages 344–354, 2015. scale alignment of natural language with knowledge [3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, base triples. In Proceedings of the Eleventh R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a International Conference on Language Resources and web of open data. In The semantic web, pages Evaluation (LREC-2018), 2018. 722–735. 2007. [19] A. Fader, S. Soderland, and O. Etzioni. Identifying [4] O. D. Balalau, F. Bonchi, T. Chan, F. Gullo, and relations for open information extraction. In M. Sozio. Finding subgraphs with maximum total Proceedings of the conference on empirical methods in density and limited overlap. In Proceedings of the natural language processing, pages 1535–1545. Eighth ACM International Conference on Web Search Association for Computational Linguistics, 2011. and Data Mining, pages 379–388. ACM, 2015. [20] P. Ferragina and U. Scaiella. Tagme: on-the-fly [5] D. Bamman, T. Underwood, and N. A. Smith. A annotation of short text fragments (by bayesian mixed effects model of literary character. In entities). In Proceedings of the 19th ACM Proceedings of the 52nd Annual Meeting of the international conference on Information and Association for Computational Linguistics (Volume 1: , pages 1625–1628. ACM, 2010. Long Papers), pages 370–379, 2014. [21] L. Gal´arraga,G. Heitz, K. Murphy, and F. M. [6] K. Bernhard and J. Vygen. Combinatorial Suchanek. Canonicalizing open knowledge bases. In optimization: Theory and algorithms. Springer, Third CIKM, pages 1679–1688, 2014. Edition, 2005., 2008. [22] L. A. Gal´arraga,C. Teflioudi, K. Hose, and [7] R. Blanco, B. B. Cambazoglu, P. Mika, and N. Torzec. F. Suchanek. Amie: association rule mining under Entity recommendations in web search. In incomplete evidence in ontological knowledge bases. In International Semantic Web Conference, pages 33–48. WWW, pages 413–422, 2013. Springer, 2013. [23] E. Galbrun, A. Gionis, and N. Tatti. Top-k [8] J. Callan, M. Hoy, C. Yoo, and L. Zhao. Clueweb09 overlapping densest subgraphs. Data Mining and data set, 2009. Knowledge Discovery, 30(5):1134–1165, 2016. [9] A. X. Chang and C. D. Manning. Sutime: A library [24] K. Gashteovski, R. Gemulla, and L. Del Corro. Minie: for recognizing and normalizing time expressions. In minimizing facts in open information extraction. In Lrec, volume 2012, pages 3735–3740, 2012. Proceedings of the 2017 Conference on Empirical [10] L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, Methods in Natural Language Processing, pages F. R. Reiss, and S. Vaithyanathan. Systemt: an 2630–2640, 2017. algebraic approach to declarative information [25] J. Getman, J. Ellis, S. Strassel, Z. Song, and extraction. In Proceedings of the 48th Annual Meeting J. Tracey. Laying the groundwork for knowledge base of the Association for Computational Linguistics, population: Nine years of linguistic resources for tac pages 128–137. Association for Computational kbp. In Proceedings of the Eleventh International Linguistics, 2010. Conference on Language Resources and Evaluation [11] J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes. (LREC 2018), 2018. Improving efficiency and accuracy in multilingual [26] M. Glass and A. Gliozzo. A dataset for web-scale entity extraction. In Proceedings of the 9th knowledge base population. In European Semantic International Conference on Semantic Systems, pages Web Conference, pages 256–271. Springer, 2018. 121–124. ACM, 2013. [27] S. Guo, M.-W. Chang, and E. Kiciman. To link or not [12] C. De Sa, A. Ratner, C. R´e, J. Shin, F. Wang, S. Wu, to link? a study on end-to-end tweet entity linking. In and C. Zhang. Deepdive: Declarative knowledge base Proceedings of the 2013 Conference of the North construction. ACM SIGMOD Record, 45(1):60–67, American Chapter of the Association for 2016. Computational Linguistics: Human Language [13] L. Del Corro and R. Gemulla. Clausie: clause-based Technologies, pages 1020–1030, 2013. open information extraction. In Proceedings of the [28] J. Hoffart, M. A. Yosef, I. Bordino, H. F¨urstenau, 22nd international conference on , M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and pages 355–366. ACM, 2013. G. Weikum. Robust disambiguation of named entities [14] A. Delpeuch. Opentapioca: Lightweight entity linking in text. In Proceedings of the Conference on Empirical for wikidata. arXiv preprint arXiv:1904.09131, 2019. Methods in Natural Language Processing, pages 782–792. Association for Computational Linguistics, [15] M. Dubey. test set for lcquad 2.0. 7 2019. [16] M. Dubey, D. Banerjee, D. Chaudhuri, and 2011.

1047 [29] H. Ji and R. Grishman. Knowledge base population: Systems, pages 89–96. ACM, 2017. Successful approaches and challenges. In Proceedings [43] N. Nakashole, G. Weikum, and F. Suchanek. Patty: a of the 49th annual meeting of the association for taxonomy of relational patterns with semantic types. computational linguistics: Human language In Proceedings of the 2012 Joint Conference on technologies-volume 1, pages 1148–1158. Association Empirical Methods in Natural Language Processing for Computational Linguistics, 2011. and Computational Natural Language Learning, pages [30] H. Ji, J. Nothman, B. Hachey, et al. Overview of 1135–1145. Association for Computational Linguistics, tac-kbp2014 entity discovery and linking tasks. In 2012. Proc. Text Analysis Conference (TAC2014), pages [44] D. B. Nguyen, A. Abujabal, N. K. Tran, M. Theobald, 1333–1339, 2014. and G. Weikum. Query-driven on-the-fly knowledge [31] D. S. Johnson and M. R. Garey. Computers and base construction. PVLDB, 11(1):66–79, 2017. intractability: A guide to the theory of [45] F. Niu, C. Zhang, C. R´e,and J. W. Shavlik. Deepdive: NP-completeness, volume 1. WH Freeman San Web-scale knowledge-base construction using Francisco, 1979. statistical learning and inference. VLDS, 12:25–28, [32] X. Lin and L. Chen. Canonicalization of open 2012. knowledge bases with side information from the source [46] H. Paulheim. Knowledge graph refinement: A survey text. In 2019 IEEE 35th International Conference on of approaches and evaluation methods. Semantic web, Data Engineering (ICDE), pages 950–961. IEEE, 2019. 8(3):489–508, 2017. [33] X. Liu, Y. Li, H. Wu, M. Zhou, F. Wei, and Y. Lu. [47] M. Ponza, L. Del Corro, and G. Weikum. Facts that Entity linking for tweets. In Proceedings of the 51st matter. In Proceedings of the 2018 Conference on Annual Meeting of the Association for Computational Empirical Methods in Natural Language Processing, Linguistics (Volume 1: Long Papers), volume 1, pages pages 1043–1048, 2018. 1304–1311, 2013. [48] D. M. Powers. Evaluation: from precision, recall and [34] E. Loper and S. Bird. Nltk: the natural language f-measure to roc, informedness, markedness and toolkit. arXiv preprint cs/0205028, 2002. correlation. 2011. [35] W. Lu, Y. Zhou, H. Lu, P. Ma, Z. Zhang, and B. Wei. [49] J. R. Raiman and O. M. Raiman. Deeptype: Boosting collective entity linking via type-guided multilingual entity linking by neural type system semantic embedding. In National CCF Conference on evolution. In Thirty-Second AAAI Conference on Natural Language Processing and Chinese Computing, Artificial Intelligence, 2018. pages 541–553. Springer, 2017. [50] H. Rosales-M´endez,B. Poblete, and A. Hogan. [36] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, Multilingual entity linking: Comparing english and S. Bethard, and D. McClosky. The stanford corenlp spanish. In LD4IE@ ISWC, pages 62–73, 2017. natural language processing toolkit. In Proceedings of [51] M. Rospocher and F. Corcoglioniti. Joint posterior 52nd annual meeting of the association for revision of nlp annotations via ontological knowledge. computational linguistics: system demonstrations, In IJCAI, pages 4316–4322, 2018. pages 55–60, 2014. [52] A. Sakor, I. O. Mulang, K. Singh, S. Shekarpour, [37] P. N. Mendes, J. Daiber, M. Jakob, and C. Bizer. M. E. Vidal, J. Lehmann, and S. Auer. Old is gold: Evaluating dbpedia spotlight for the tac-kbp entity linguistic driven approach for entity and relation linking task. In Proceedings of the TAC-KBP 2011 linking of short text. In Proceedings of the 2019 Workshop, pages 118–120, 2011. Conference of the North American Chapter of the [38] P. N. Mendes, M. Jakob, A. Garc´ıa-Silva, and Association for Computational Linguistics: Human C. Bizer. Dbpedia spotlight: shedding light on the Language Technologies, Volume 1 (Long and Short web of documents. In Proceedings of the 7th Papers), pages 2336–2346, 2019. international conference on semantic systems, pages [53] M. Schmitz, R. Bart, S. Soderland, O. Etzioni, et al. 1–8. ACM, 2011. Open language learning for information extraction. In [39] F. Mesquita, M. Cannaviccio, J. Schmidek, P. Mirza, Proceedings of the 2012 Joint Conference on Empirical and D. Barbosa. Knowledgenet: A benchmark dataset Methods in Natural Language Processing and for knowledge base population. In Proceedings of the Computational Natural Language Learning, pages 2019 Conference on Empirical Methods in Natural 523–534. Association for Computational Linguistics, Language Processing and the 9th International Joint 2012. Conference on Natural Language Processing [54] W. Shen, J. Wang, P. Luo, and M. Wang. Liege: link (EMNLP-IJCNLP), pages 749–758, 2019. entities in web lists with knowledge base. In [40] G. A. Miller. Wordnet: a lexical for english. Proceedings of the 18th ACM SIGKDD international Communications of the ACM, 38(11):39–41, 1995. conference on Knowledge discovery and data mining, [41] A. Moro, A. Raganato, and R. Navigli. Entity linking pages 1424–1432. ACM, 2012. meets word sense disambiguation: a unified approach. [55] W. Shen, J. Wang, P. Luo, and M. Wang. Linden: Transactions of the Association for Computational linking named entities with knowledge base via Linguistics, 2:231–244, 2014. semantic knowledge. In Proceedings of the 21st [42] I. O. Mulang, K. Singh, and F. Orlandi. Matching international conference on World Wide Web, pages natural language relations to knowledge graph 449–458. ACM, 2012. properties for question answering. In Proceedings of [56] W. Shen, J. Wang, P. Luo, and M. Wang. Linking the 13th International Conference on Semantic named entities in tweets with knowledge base via user

1048 interest modeling. In Proceedings of the 19th ACM 3(2):19–28, 2016. SIGKDD international conference on Knowledge [63] D. Vrandeˇci´c.Wikidata: A new platform for discovery and data mining, pages 68–76. ACM, 2013. collaborative data collection. In WWW, pages [57] J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and 1063–1064, 2012. C. R´e.Incremental knowledge base construction using [64] T.-H. Wu, Z. Wu, B. Kao, and P. Yin. Towards deepdive. PVLDB, 8(11):1310–1321, 2015. practical open knowledge base canonicalization. In [58] K. Singh, A. S. Radhakrishna, A. Both, Proceedings of the 27th ACM International Conference S. Shekarpour, I. Lytra, R. Usbeck, A. Vyas, on Information and Knowledge Management, pages A. Khikmatullaev, D. Punjani, C. Lange, et al. Why 883–892. ACM, 2018. reinvent the wheel: Let’s build question answering [65] J. Xie, S. Kelley, and B. K. Szymanski. Overlapping systems together. In WWW, pages 1247–1256, 2018. community detection in networks: The state-of-the-art [59] M. Sozio and A. Gionis. The community-search and comparative study. Acm computing surveys problem and how to plan a successful cocktail party. (csur), 45(4):43, 2013. In Proceedings of the 16th ACM SIGKDD [66] W. Zhang, Y.-C. Sim, J. Su, and C.-L. Tan. Entity international conference on Knowledge discovery and linking with effective acronym expansion, instance data mining, pages 939–948. ACM, 2010. selection and topic modeling. In Twenty-Second [60] R. Usbeck, A.-C. N. Ngomo, B. Haarmann, International Joint Conference on Artificial A. Krithara, M. R¨oder,and G. Napolitano. 7th open Intelligence, 2011. challenge on question answering over [67] W. Zhang, J. Su, C. L. Tan, and W. T. Wang. Entity (qald-7). In Semantic Web Evaluation Challenge, linking leveraging: automatically generated pages 59–69. Springer, 2017. annotation. In Proceedings of the 23rd International [61] S. Vashishth, P. Jain, and P. Talukdar. Cesi: Conference on Computational Linguistics, pages Canonicalizing open knowledge bases using 1290–1298. Association for Computational Linguistics, embeddings and side information. In WWW, pages 2010. 1317–1327, 2018. [68] Y. Zhang, H. Dai, Z. Kozareva, A. J. Smola, and [62] M. Vijaymeena and K. Kavitha. A survey on L. Song. Variational reasoning for question answering similarity measures in . Machine Learning with knowledge graph. In Thirty-Second AAAI and Applications: An International Journal, Conference on Artificial Intelligence, 2018.

1049