<<

Neural Relation Extraction for Knowledge Base Enrichment

Bayu Distiawan Trisedya1, Gerhard Weikum2, Jianzhong Qi1, Rui Zhang1∗ 1 The University of Melbourne, Australia 2 Max Planck Institute for Informatics, Saarland Informatics Campus, Germany {btrisedya@student, jianzhong.qi@, rui.zhang@}unimelb.edu.au [email protected]

Abstract Input sentence: "New York University is a private university in Manhattan." We study relation extraction for knowledge Unsupervised approach output: base (KB) enrichment. Specifically, we aim hNYU, is, private universityi to extract entities and their relationships from hNYU, is private university in, Manhattani sentences in the form of triples and map the Supervised approach output: elements of the extracted triples to an existing hNYU, instance of, Private Universityi hNYU, located in, Manhattani KB in an end-to-end manner. Previous stud- Canonicalized output: ies focus on the extraction itself and rely on hQ49210, P31, Q902104i Disambiguation (NED) to map hQ49210, P131, Q11299i triples into the KB space. This way, NED er- rors may cause extraction errors that affect the Table 1: Relation extraction example. overall precision and recall. To address this problem, we propose an end-to-end relation Previous studies work on embedding-based extraction model for KB enrichment based on model (Nguyen et al., 2018; Wang et al., 2015) a neural encoder-decoder model. We collect high-quality training data by distant supervi- and entity alignment model (Chen et al., 2017; sion with co-reference resolution and para- Sun et al., 2017; Trisedya et al., 2019) to en- phrase detection. We propose an n-gram based rich a knowledge base. Following the success of attention model that captures multi-word en- the sequence-to-sequence architecture (Bahdanau tity names in a sentence. Our model employs et al., 2015) for generating sentences from struc- jointly learned word and entity embeddings to tured data (Marcheggiani and Perez-Beltrachini, support named entity disambiguation. Finally, 2018; Trisedya et al., 2018), we employ this ar- our model uses a modified beam search and a triple classifier to help generate high-quality chitecture to do the opposite, which is extracting triples. Our model outperforms state-of-the- triples from a sentence. art baselines by 15.51% and 8.38% in terms of In this paper, we study how to enrich a KB by F1 score on two real-world datasets. relation exaction from textual sources. Specif- ically, we aim to extract triples in the form of 1 Introduction hh, r, ti, where h is a head entity, t is a tail en- tity, and r is a relationship between the enti- Knowledge bases (KBs), often in the form of ties. Importantly, as KBs typically have much knowledge graphs (KGs), have become essential better coverage on entities than on relationships, resources in many tasks including Q&A systems, we assume that h and t are existing entities in recommender system, and natural language gener- a KB, r is a predicate that falls in a prede- ation. Large KBs such as DBpedia (Auer et al., fined set of predicates we are interested in, but ¨ 2007), (Vrandecic and Krotzsch, 2014) the relationship hh, r, ti does not exist in the KB and (Suchanek et al., 2007) contain millions yet. We aim to find more relationships between of facts about entities, which are represented in the h and t and add them to the KB. For exam- form of subject-predicate-object triples. However, ple, from the first extracted triples in Table1 we these KBs are far from complete and mandate con- may recognize two entities "NYU" (abbreviation tinuous enrichment and curation. of New York University) and "Private ∗Rui Zhang is the corresponding author. University", which already exist in the KB; also the predicate "instance of" is in the added to the KB. set of predefined predicates we are interested in, We aim to integrate the extraction and the but the relationship of hNYU, instance of, canonicalization tasks by proposing an end- Private Universityi does not exist in the to-end neural learning model to jointly extract KB. We aim to add this relationship to our KB. triples from sentences and map them into This is the typical situation for KB enrichment an existing KB. Our method is based on the (as opposed to constructing a KB from scratch or encoder-decoder framework (Cho et al., 2014) performing relation extraction for other purposes, by treating the task as a translation of a sentence such as Q&A or summarization). into a sequence of elements of triples. For the KB enrichment mandates that the entities and example in Table1, our model aims to translate relationships of the extracted triples are canonical- "New York University is a private ized by mapping them to their proper entity and university in Manhattan" into a se- predicate IDs in a KB. Table1 illustrates an ex- quence of IDs "Q49210 P31 Q902104 ample of triples extracted from a sentence. The Q49210 P131 Q11299", from which we can entities and predicate of the first extracted triple, derive two triples to be added to the KB. including NYU, instance of, and Private A standard encoder-decoder model with atten- University, are mapped to their unique IDs tion (Bahdanau et al., 2015) is, however, unable Q49210, P31, and Q902104, respectively, to to capture the multi-word entity names and ver- comply with the semantic space of the KB. bal or noun phrases that denote predicates. To Previous studies on relation extraction have address this problem, we propose a novel form employed both unsupervised and supervised ap- of n-gram based attention that computes the n- proaches. Unsupervised approaches typically start gram combination of attention weight to capture with a small set of manually defined extraction the verbal or noun phrase context that comple- patterns to detect entity names and phrases about ments the word level attention of the standard at- relationships in an input text. This paradigm is tention model. Our model thus can better cap- known as Open Information Extraction (Open IE) ture the multi-word context of entities and rela- (Banko et al., 2007; Corro and Gemulla, 2013; tionships. Our model harnesses pre-trained word Gashteovski et al., 2017). In this line of ap- and entity embeddings that are jointly learned with proaches, both entities and predicates are captured skip gram (Mikolov et al., 2013) and TransE (Bor- in their surface forms without canonicalization. des et al., 2013). The advantages of our jointly Supervised approaches train statistical and neural learned embeddings are twofold. First, the em- models for inferring the relationship between two beddings capture the relationship between words known entities in a sentence (Mintz et al., 2009; and entities, which is essential for named entity Riedel et al., 2010, 2013; Zeng et al., 2015; Lin disambiguation. Second, the entity embeddings et al., 2016). Most of these studies employ a pre- preserve the relationships between entities, which processing step to recognize the entities. Only help to build a highly accurate classifier to filter few studies have fully integrated the mapping of the invalid extracted triples. To cope with the lack extracted triples onto uniquely identified KB en- of fully labeled training data, we adapt distant su- tities by using logical reasoning on the existing pervision to generate aligned pairs of sentence and KB to disambiguate the extracted entities (e.g., triple as the training data. We augment the process (Suchanek et al., 2009; Sa et al., 2017)). with co-reference resolution (Clark and Manning, 2016) and dictionary-based paraphrase detection Most existing methods thus entail the need for (Ganitkevitch et al., 2013; Grycner and Weikum, Named Entity Disambiguation (NED) (cf. the sur- 2016). The co-reference resolution helps extract vey by Shen et al.(2015)) as a separate process- sentences with implicit entity names, which en- ing step. In addition, the mapping of relationship larges the set of candidate sentences to be aligned phrases onto KB predicates necessitates another with existing triples in a KB. The paraphrase de- mapping step, typically aided by paraphrase dic- tection helps filter sentences that do not express tionaries. This two-stage architecture is inherently any relationships between entities. prone to error propagation across its two stages: NED errors may cause extraction errors (and vice The main contributions of this paper are: versa) that lead to inaccurate relationships being • We propose an end-to-end model for extract- ing and canonicalizing triples to enrich a KB. related to ours is Neural Open IE (Cui et al., 2018), The model reduces error propagation between which proposed an encoder-decoder with attention relation extraction and NED, which existing model to extract triples. However, this work is not approaches are prone to. geared for extracting relations of canonicalized en- • We propose an n-gram based attention model tities. Another line of studies use neural learning to effectively map the multi-word mentions of for (He et al., 2018), but the entities and their relationships into uniquely goal here is to recognize the predicate-argument identified entities and predicates. We propose structure of a single input sentence – as opposed joint learning of word and entity embeddings to extracting relations from a corpus. to capture the relationship between words and All of these methods generate triples where the entities for named entity disambiguation. We head and tail entities and the predicate stay in further propose a modified beam search and a their surface forms. Therefore, different names triple classifier to generate high-quality triples. and phrases for the same entities result in multiple • We evaluate the proposed model over two triples, which would pollute the KG if added this real-world datasets. We adapt distant super- way. The only means to map triples to uniquely vision with co-reference resolution and para- identified entities in a KG is by post-processing phrase detection to obtain high-quality training via (NED) methods (Shen et al., data. The experimental results show that our 2015) or by clustering with subsequent mapping model consistently outperforms a strong base- (Galarraga´ et al., 2014). line for neural relation extraction (Lin et al., 2016) coupled with state-of-the-art NED mod- 2.2 Entity-aware Relation Extraction els (Hoffart et al., 2011; Kolitsas et al., 2018). Inspired by the work of Brin(1998), state-of-the- 2 Related Work art methods employ distant supervision by lever- aging seed facts from an existing KG (Mintz et al., 2.1 Open Information Extraction 2009; Suchanek et al., 2009; Carlson et al., 2010). Banko et al.(2007) introduced the paradigm of These methods learn extraction patterns from seed Open Information Extraction (Open IE) and pro- facts, apply the patterns to extract new fact candi- posed a pipeline that consists of three stages: dates, iterate this principle, and finally use statis- learner, extractor, and assessor. The learner uses tical (e.g., a classifier) for reducing the dependency- information to learn patterns false positive rate. Some of these methods hinge for extraction, in an unsupervised way. The ex- on the assumption that the co-occurrence of a seed tractor generates candidate triples by identifying fact’s entities in the same sentence is an indicator noun phrases as arguments and connecting phrases of expressing a semantic relationship between the as predicates. The assessor assigns a probability to entities. This is a potential source of wrong la- each candidate triple based on statistical evidence. beling. Follow-up studies (Hoffmann et al., 2010; This approach was prone to extracting incorrect, Riedel et al., 2010, 2013; Surdeanu et al., 2012) verbose and uninformative triples. Various follow- overcome this limitation by various means, includ- up studies (Fader et al., 2011; Mausam et al., ing the use of relation-specific lexicons and latent 2012; Angeli et al., 2015; Mausam, 2016) im- factor models. Still, these methods treat entities by proved the accuracy of Open IE, by adding hand- their surface forms and disregard their mapping to crafted patterns or by using distant supervision. existing entities in the KG. Corro and Gemulla(2013) developed ClausIE, a Suchanek et al.(2009) and Sa et al.(2017) used method that analyzes the clauses in a sentence and probabilistic-logical inference to eliminate false derives triples from this structure. Gashteovski positives, based on constraint solving or Monte et al.(2017) developed MinIE to advance ClausIE Carlo sampling over probabilistic graphical mod- by making the resulting triples more concise. els, respectively. These methods integrate entity Stanovsky et al.(2018) proposed a supervised linking (i.e., NED) into their models. However, learner for Open IE by casting relation extrac- both have high computational complexity and rely tion into sequence tagging. A bi-LSTM model on modeling constraints and appropriate priors. is trained to predict the label (entity, predicate, or Recent studies employ neural networks to learn other) of each token of the input. The work most the extraction of triples. Nguyen and Grish- Dataset Collection Neural Relation Extraction Embedding Module Module Module

Joint learning Expected output: skip-gram & ; article TransE

Triple classifier Wikidata Word Entity Embeddings Embeddings Decoder 0.2 0.4 0.1 0.2 0.1 0.1 0.5 0.1 0.4 0.2

0.1 0.5 0.1 0.1 0.2 0.1 0.5 0.1 0.5 0.1 N-gram-based attention Distant 0.2 0.3 0.3 0.3 0.2 0.2 0.3 0.3 0.3 0.3 supervision 0.4 0.2 0.2 0.1 0.1 0.2 0.3 0.3 0.3 0.1 Encoder

Sentence input: New York University is a private university in Manhattan.

Sentence-Triple pairs Sentence input: New York University is a private university in Manhattan. Expected output: Q49210 P31 Q902104 Q49210 P131 Q11299

Sentence input: New York Times Building is a skyscraper in Manhattan Expected output: Q192680 P131 Q11299 ...

Figure 1: Overview of our proposed solution. man(2015) proposed Convolution Networks with hw1, w2, ..., wii as the input, where wi is a token multi-sized window kernel. Zeng et al.(2015) at position i in the sentence. We aim to extract a proposed Piecewise Convolution Neural Networks set of triples O = {o1, o2, ..., oj} from the sen- (PCNN). Lin et al.(2016, 2017) improved this ap- tence, where oj = hhj, rj, tji, hj, tj ∈ E, and proach by proposing PCNN with sentence-level rj ∈ R. Table1 illustrates the input and target attention. This method performed best in exper- output of our problem. imental studies; hence we choose it as the main baseline against which we compare our approach. 3.1 Solution Framework Follow-up studies considered further variations: Figure1 illustrates the overall solution frame- Zhou et al.(2018) proposed hierarchical attention, work. Our framework consists of three compo- Ji et al.(2017) incorporated entity descriptions, nents: data collection module, embedding mod- Miwa and Bansal(2016) incorporated syntactic ule, and neural relation extraction module. features, and Sorokin and Gurevych(2017) used background knowledge for contextualization. In the data collection module (detailed in Sec- tion 3.2), we align known triples in an existing KB None of these neural models is geared for KG with sentences that contain such triples from a text enrichment, as the canonicalization of entities is corpus. The aligned pairs of sentences and triples out of their scope. will later be used as the training data in our neural 3 Proposed Model relation extraction module. This alignment is done by distant supervision. To obtain a large number We start with the problem definition. Let G = of high-quality alignments, we augment the pro- (E,R) be an existing KG where E and R are cess with a co-reference resolution to extract sen- the sets of entities and relationships (predicates) tences with implicit entity names, which enlarges in G, respectively. We consider a sentence S = the set of candidate sentences to be aligned. We further use dictionary based paraphrase detection We map an entity mention in a sentence to the to filter sentences that do not express any relation- corresponding entity entry (i.e., Wikidata ID) in ships between entities. Wikidata via the hyperlink associated to the en- In the embedding module (detailed in Sec- tity mention, which is recorded in Wikidata as the tion 3.3), we propose a joint learning of word url property of the entity entry. Each pair may and entity embeddings by combining skip-gram contain one sentence and multiple triples. We sort (Mikolov et al., 2013) to compute the word em- the order of the triples based on the order of the beddings and TransE (Bordes et al., 2013) to com- predicate paraphrases that indicate the relation- pute the entity embeddings. The objective of the ships between entities in the sentence. We col- joint learning is to capture the similarity of words lect sentence-triple pairs by extracting sentences and entities that helps map the entity names into that contain both head and tail entities of Wikidata the related entity IDs. Moreover, the resulting en- triples. To generate high-quality sentence-triple tity embeddings are used to train a triple classi- pairs, we propose two additional steps: (1) extract- fier that helps filter invalid triples generated by our ing sentences that contain implicit entity names neural relation extraction model. using co-reference resolution, and (2) filtering sen- In the neural relation extraction module (de- tences that do not express any relationships using tailed in Section 3.4), we propose an n-gram based paraphrase detection. We detail these steps below. attention model by expanding the attention mech- Prior to aligning the sentences with triples, anism to the n-gram token of a sentence. The n- in Step (1), we find the implicit entity names gram attention computes the n-gram combination to increase the number of candidate sentences of attention weight to capture the verbal or noun to be aligned. We apply co-reference res- phrase context that complements the word level olution (Clark and Manning, 2016) to each attention of the standard attention model. This paragraph in a Wikipedia article and replace expansion helps our model to better capture the the extracted co-references with the proper en- multi-word context of entities and relationships. tity name. We observe that the first sen- The output of the encoder-decoder model is a tence of a paragraph in a Wikipedia arti- sequence of the entity and predicate IDs where ev- cle may contain a pronoun that refers to the ery three IDs indicate a triple. To generate high- main entity. For example, there is a para- quality triples, we propose two strategies. The first graph in the Barack Obama article that starts strategy uses a modified beam search that com- with a sentence "He was reelected to putes the lexical similarity of the extracted enti- the Illinois Senate in 1998". This ties with the surface form of entity names in the may cause the standard co-reference resolution to input sentence to ensure the correct entity predic- miss the implicit entity names for the rest of the tion. The second strategy uses a triple classifier paragraph. To address this problem, we heuristi- that is trained using the entity embeddings from cally replace the pronouns in the first sentence of a the joint learning to filter the invalid triples. The paragraph if the main entity name of the Wikipedia triple generation process is detailed in Section 3.5 page is not mentioned. For the sentence in the pre- "He" "Barack 3.2 Dataset Collection vious example, we replace with Obama". The intuition is that a Wikipedia article We aim to extract triples from a sentence for contains content of a single entity of interest, and KB enrichment by proposing a supervised rela- that the pronouns mentioned in the first sentence tion extraction model. To train such a model, we of a paragraph mostly relate to the main entity. need a large volume of fully labeled training data in the form of sentence-triple pairs. Following In Step (2), we use a dictionary based para- Sorokin and Gurevych(2017), we use distant su- phrase detection to capture relationships between pervision (Mintz et al., 2009) to align sentences in entities in a sentence. First, we create a dictionary Wikipedia1 with triples in Wikidata2 (Vrandecic by populating predicate paraphrases from three and Krotzsch¨ , 2014). sources including PATTY (Nakashole et al., 2012), POLY (Grycner and Weikum, 2016), and 1 https://dumps.wikimedia.org/enwiki/latest/enwiki- PPDB (Ganitkevitch et al., 2013) that yield 540 latest-pages-articles..bz2 2https://dumps.wikimedia.org/wikidatawiki/entities/latest- predicates and 24, 013 unique paraphrases. For all.ttl.gz example, predicate paraphrases for the relation- #pairs #triples #entities #predicates a joint learning of word and entity embeddings All () 255,654 330,005 279,888 158 Train+val 225,869 291,352 249,272 157 that is effective to capture the similarity between Test (WIKI) 29,785 38,653 38,690 109 words and entities for named entity disambigua- Test (GEO) 1,000 1,095 124 11 tion (Yamada et al., 2016). Note that our method differs from that of Yamada et al.(2016). We use Table 2: Statistics of the dataset. joint learning by combining skip-gram (Mikolov et al., 2013) to compute the word embeddings and ship "place of birth" are {born in, TransE (Bordes et al., 2013) to compute the entity was born in, ...}. Then, we use this embeddings (including the relationship embed- dictionary to filter sentences that do not express dings), while Yamada et al.(2016) use Wikipedia any relationships between entities. We use exact Link-based Measure (WLM) (Milne and Witten, string matching to find verbal or noun phrases in 2008) that does not consider the relationship em- a sentence which is a paraphrases of a predicate beddings. of a triple. For example, for the triple hBarack Our model learns the entity embeddings by min- Obama, place of birth, Honolului, imizing a margin-based objective function J : the sentence "Barack Obama was born E X X  0  in 1961 in Honolulu, Hawaii" will be JE = max 0, γ + f(tr) − f(tr) (1) t ∈T 0 0 retained while the sentence "Barack Obama r r tr ∈Tr visited Honolulu in 2010" will be Tr = {hh, r, ti|hh, r, ti ∈ G} (2) 0  0 0  0 0 removed (the sentence may be retained if there Tr = h , r, t | h ∈ E ∪ h, r, t | t ∈ E (3) is another valid triple hBarack Obama, f(tr) = kh + r − tk (4) visited, Honolului). This helps filter Here, kxk is the L1-Norm of vector x, γ is a mar- noises for the sentence-triple alignment. gin hyperparameter, Tr is the set of valid relation- The collected dataset contains 255,654 ship triples from a KG G, and T 0 is the set of cor- sentence-triple pairs. For each pair, the maximum r rupted relationship triples (recall that E is the set number of triples is four (i.e., a sentence can of entities in G). The corrupted triples are used produce at most four triples). We split the dataset as negative samples, which are created by replac- into train set (80%), dev set (10%) and test set ing the head or tail entity of a valid triple in Tr (10%) (we call it the WIKI test dataset). For with a random entity. We use all triples in Wiki- stress testing (to test the proposed model on a data except those which belong to the testing data different style of text than the training data), we to compute the entity embeddings. also collect another test dataset outside Wikipedia. To establish the interaction between the entity We apply the same procedure to the user reviews and word embeddings, we follow the Anchor of a travel website. First, we collect user reviews Context Model proposed by Yamada et al.(2016). on 100 popular landmarks in Australia. Then, First, we generate a by combining we apply the adapted distant supervision to the the original text and the modified anchor text reviews and collect 1,000 sentence-triple pairs (we of Wikipedia. This is done by replacing the call it the GEO test dataset). Table2 summarizes entity names in a sentence with the related entity the statistics of our datasets. or predicate IDs. For example, the sentence "New York University is a private 3.3 Joint Learning of Word and Entity university in Manhattan" is mod- Embeddings ified into "Q49210 is a Q902104 in Our relation extraction model is based on Q11299". Then, we use the skip-gram method to the encoder-decoder framework which has been compute the word embeddings from the generated widely used in Neural to corpus (the entity IDs in the modified anchor text translate text from one language to another. In our are treated as words in the skip-gram model). setup, we aim to translate a sentence into triples, Given a sequence of n words [w1, w2, ..., wn], and hence the vocabulary of the source input is a The model learns the word embeddings, by set of English words while the vocabulary of the minimizing the following objective function JW : target output is a set of entity and predicate IDs n 1 X X in an existing KG. To compute the embeddings J = log P (w |w ) (5) W T t+j t of the source and target vocabularies, we propose t=1 −c≤j≤c,j6=0 We address the above problem by proposing an 0 > n-gram based attention model. This model com- exp(vwt+j vwt ) P (wt+j|wt) = (6) putes the attention of all possible n-grams of the PW (v0 >v ) i=1 i wt sentence input. The attention weights are com- puted over the n-gram combinations of the word where c is the size of the context window, wt embeddings, and hence the context vector for the denotes the target word, and wt+j is the context 0 decoder is computed as follows. word; vw and vw are the input and output vector representations of word w, and W is the vocab-  |N| |Xn|  ulary size. The overall objective function of the d e X n X n n ct = h ; W  αi xi  (8) joint learning of word and entity embeddings is: n=1 i=1 exp(he>Vnxn) J = JE + JW (7) n i αi = n (9) P|X | e> n n j=1 exp(h V xj ) 3.4 N-gram Based Attention Model Here, cd is the context vector of the decoder at Our proposed relation extraction model integrates t timestep t, he is the last hidden state of the en- the extraction and canonicalization tasks for KB coder, the superscript n indicates the n-gram com- enrichment in an end-to-end manner. To build bination, x is the word embeddings of input sen- such a model, we employ an encoder-decoder tence, |Xn| is the total number of n-gram token model (Cho et al., 2014) to translate a sentence combination, N indicates the maximum value of into a sequence of triples. The encoder encodes n used in the n-gram combinations (N = 3 in a sentence into a vector that is used by the de- our experiments), W and V are learned parameter coder as a context to generate a sequence of triples. matrices, and α is the attention weight. Because we treat the input and output as a se- quence, We use the LSTM networks (Hochreiter Training and Schmidhuber, 1997) in the encoder and the In the training phase, in addition to the sentence- decoder. triple pairs collected using distant supervi- The encoder-decoder with attention model sion (see Section 3.2), we also add pairs of (Bahdanau et al., 2015) has been used in machine hEntity-name, Entity-IDi of all entities translation. However, in the relation extraction in the KB to the training data, e.g., hNew York task, the attention model cannot capture the multi- University, Q49210i. This allows the word entity names. In our preliminary investiga- model to learn the mapping between entity names tion, we found that the attention model yields mis- and entity IDs, especially for the unseen entities. alignment between the word and the entity. The above problem is due to the same words 3.5 Triple Generation in the names of different entities (e.g., the word The output of the encoder-decoder model is a se- University in different university names such quence of the entity and predicate IDs where every as New York University, Washington three tokens indicate a triple. Therefore, to extract University, etc.). During training, the model a triple, we simply group every three tokens of the pays more attention to the word University generated output. However, the greedy approach to differentiate different types of entities of a (i.e., picking the entity with the highest probabil- similar name, e.g., New York University, ity of the last softmax layer of the decoder) may New York Times Building, or New lead the model to extract incorrect entities due to York Life Building, but not the same the similarity between entity embeddings (e.g., the types of entities of different names (e.g., New embeddings of New York City and Chicago York University and Washington may be similar because both are cities in USA). To University). This may cause errors in entity address this problem, we propose two strategies: alignment, especially when predicting the ID of re-ranking the predicted entities using a modified an entity that is not in the training data. Even beam search and filtering invalid triples using a though we add hEntity-name, Entity-IDi triple classifier. pairs as training data (see the Training section), The modified beam search re-ranks top-k (k = the misalignments still take place. 10 in our experiments) entity IDs that are predicted WIKI GEO Model Precision Recall F1 Precision Recall F1 MinIE (+AIDA) 0.3672 0.4856 0.4182 0.3574 0.3901 0.3730 MinIE (+NeuralEL) 0.3511 0.3967 0.3725 0.3644 0.3811 0.3726 ClausIE (+AIDA) 0.3617 0.4728 0.4099 0.3531 0.3951 0.3729 Existing ClausIE (+NeuralEL) 0.3445 0.3786 0.3607 0.3563 0.3791 0.3673 Models CNN (+AIDA) 0.4035 0.3503 0.3750 0.3715 0.3165 0.3418 CNN (+NeuralEL) 0.3689 0.3521 0.3603 0.3781 0.3005 0.3349 Single Attention 0.4591 0.3836 0.4180 0.4010 0.3912 0.3960 Single Attention (+pre-trained) 0.4725 0.4053 0.4363 0.4314 0.4311 0.4312 Single Attention (+beam) 0.6056 0.5231 0.5613 0.5869 0.4851 0.5312 Encoder- Single Attention (+triple classifier) 0.7378 0.5013 0.5970 0.6704 0.5301 0.5921 Decoder Transformer 0.4628 0.3897 0.4231 0.4575 0.4620 0.4597 Models Transformer (+pre-trained) 0.4748 0.4091 0.4395 0.4841 0.4831 0.4836 Transformer (+beam) 0.5829 0.5025 0.5397 0.6181 0.6161 0.6171 Transformer (+triple classifier) 0.7307 0.4866 0.5842 0.7124 0.5761 0.6370 N-gram Attention 0.7014 0.6432 0.6710 0.6029 0.6033 0.6031 N-gram Attention (+pre-trained) 0.7157 0.6634 0.6886 0.6581 0.6631 0.6606 Proposed N-gram Attention (+beam) 0.7424 0.6845 0.7123 0.6816 0.6861 0.6838 N-gram Attention (+triple classifier) 0.8471 0.6762 0.7521 0.7705 0.6771 0.7208

Table 3: Experiments result. by the decoder by computing the edit distance be- with a learning rate of 0.0002. tween the entity names (obtained from the KB) and every n-gram token of the input sentence. The 4.2 Models intuition is that the entity name should be men- 3 tioned in the sentence so that the entity with the We compare our proposed model with three ex- highest similarity will be chosen as the output. isting models including CNN (the state-of-the- Our triple classifier is trained with entity em- art supervised approach by Lin et al.(2016)), beddings from the joint learning (see Section 3.3). MiniE (the state-of-the-art unsupervised approach Triple classification is one of the metrics to evalu- by Gashteovski et al.(2017)), and ClausIE by ate the quality of entity embeddings (Socher et al., Corro and Gemulla(2013). To map the extracted 2013). We build a classifier to determine the valid- entities by these models, we use two state-of-the- ity of a triple hh, r, ti. We train a binary classifier art NED systems including AIDA (Hoffart et al., based on the plausibility score (h + r − t) (the 2011) and NeuralEL (Kolitsas et al., 2018). The score to compute the entity embeddings). We cre- precision (tested on our test dataset) of AIDA and ate negative samples by corrupting the valid triples NeuralEL are 70% and 61% respectively. To map (i.e., replacing the head or tail entity by a random the extracted predicates (relationships) of the un- entity). The triple classifier is effective to filter in- supervised approaches output, we use the dictio- valid triple such as hNew York University, nary based paraphrase detection. We use the same capital of, Manhattani. dictionary that is used to collect the dataset (i.e., the combination of three paraphrase dictionaries 4 Experiments including PATTY (Nakashole et al., 2012), POLY (Grycner and Weikum, 2016), and PPDB (Gan- We evaluate our model on two real datasets includ- itkevitch et al., 2013)). We replace the extracted ing WIKI and GEO test datasets (see Section 3.2). predicate with the correct predicate ID if one of We use precision, recall, and F1 score as the eval- the paraphrases of the correct predicate (i.e., the uation metrics. gold standard) appear in the extracted predicate. 4.1 Hyperparameters Otherwise, we replace the extracted predicate with "NA" to indicate an unrecognized predicate. We We use grid search to find the best hyper- also compare our N-gram Attention model with parameters for the networks. We use 512 hidden two encoder-decoder based models including the units for both the encoder and the decoder. We use Single Attention model (Bahdanau et al., 2015) 64 dimensions of pre-trained word and entity em- and Transformer model (Vaswani et al., 2017). beddings (see Section 3.3). We use a 0.5 dropout rate for regularization on both the encoder and the 3The code and the dataset are made available at decoder. We use Adam (Kingma and Ba, 2015) http://www.ruizhang.info/GKB/gkb.htm 4.3 Results Discussion We further perform manual error analysis. We Table3 shows that the end-to-end models outper- found that the incorrect output of our model is form the existing model. In particular, our pro- caused by the same entity name of two differ- posed n-gram attention model achieves the best ent entities (e.g., the name of Michael Jordan results in terms of precision, recall, and F1 score. that refers to the American basketball player or Our proposed model outperforms the best existing the English footballer). The modified beam search model (MinIE) by 33.39% and 34.78% in terms cannot disambiguate those entities as it only con- of F1 score on the WIKI and GEO test dataset re- siders the lexical similarity. We consider using spectively. These results are expected since the context-based similarity as future work. existing models are affected by the error propa- gation of the NED. As expected, the combination 5 Conclusions of the existing models with AIDA achieves higher We proposed an end-to-end relation extraction F1 scores than the combination with NeuralEL as model for KB enrichment that integrates the ex- AIDA achieves a higher precision than NeuralEL. traction and canonicalization tasks. Our model To further show the effect of error propagation, thus reduces the error propagation between rela- we set up an experiment without the canonical- tion extraction and NED that existing approaches ization task (i.e., the objective is predicting a re- are prone to. To obtain high-quality training data, lationship between known entities). We remove we adapt distant supervision and augment it with the NED pre-processing step by allowing the CNN co-reference resolution and paraphrase detection. model to access the correct entities. Meanwhile, We propose an n-gram based attention model that we provide the correct entities to the decoder of better captures the multi-word entity names in a our proposed model. In this setup, our proposed sentence. Moreover, we propose a modified beam model achieves 86.34% and 79.11%, while CNN search and a triple classification that helps the achieves 81.92% and 75.82% in precision over the model to generate high-quality triples. WIKI and GEO test datasets, respectively. Experimental results show that our proposed model outperforms the existing models by 33.39% Our proposed n-gram attention model outper- and 34.78% in terms of F1 score on the WIKI forms the end-to-end models by 15.51% and and GEO test dataset respectively. These re- 8.38% in terms of F1 score on the WIKI and GEO sults confirm that our model reduces the error test datasets, respectively. The Transformer model propagation between NED and relation extraction. also only yields similar performance to that of Our proposed n-gram attention model outperforms the Single Attention model, which is worse than the other encoder-decoder models by 15.51% and ours. These results indicate that our model cap- 8.38% in terms of F1 score on the two real-world tures multi-word entity name (in both datasets, datasets. These results confirm that our model bet- 82.9% of the entities have multi-word entity name) ter captures the multi-word entity names in a sen- in the input sentence better than the other models. tence. In the future, we plan to explore context- Table3 also shows that the pre-trained embed- based similarity to complement the lexical simi- dings improve the performance of the model in all larity to improve the overall performance. measures. Moreover, the pre-trained embeddings Acknowledgments help the model to converge faster. In our experi- ments, the models that use the pre-trained embed- Bayu Distiawan Trisedya is supported by the In- dings converge in 20 epochs on average, while the donesian Endowment Fund for Education (LPDP). models that do not use the pre-trained embeddings This work is done while Bayu Distiawan Trisedya converge in 30 − 40 epochs. Our triple classifier is visiting the Max Planck Institute for In- combined with the modified beam search boost the formatics. This work is supported by Aus- performance of the model. The modified beam tralian Research Council (ARC) Discovery Project search provides a high recall by extracting the cor- DP180102050, Google Faculty Research Award, rect entities based on the surface form in the input and the National Science Foundation of China sentence while the triple classifier provides a high (Project No. 61872070 and No. 61402155). precision by filtering the invalid triples. References Luciano Del Corro and Rainer Gemulla. 2013. Clausie: clause-based open information extraction. In Pro- Gabor Angeli, Melvin Jose Johnson Premkumar, and ceedings of International Conference on World Wide Christopher D. Manning. 2015. Leveraging linguis- Web, pages 355–366. tic structure for open domain information extrac- tion. In Proceedings of Association for Computa- Lei Cui, Furu Wei, and Ming Zhou. 2018. Neural open tional Linguistics, pages 344–354. information extraction. In Proceedings of Associa- tion for Computational Linguistics, pages 407–413. Soren¨ Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G. Ives. Anthony Fader, Stephen Soderland, and Oren Etzioni. 2007. Dbpedia: A nucleus for a web of open data. 2011. Identifying relations for open information ex- In Proceedings of International Con- traction. In Proceedings of Empirical Methods in ference, pages 722–735. Natural Language Processing, pages 1535–1545. Luis Galarraga,´ Geremy Heitz, Kevin Murphy, and Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Fabian M. Suchanek. 2014. Canonicalizing open gio. 2015. Neural machine translation by jointly knowledge bases. In Proceedings of International learning to align and translate. In Proceedings of Conference on Information and Knowledge Man- the International Conference on Learning Represen- agement, pages 1679–1688. tations. Juri Ganitkevitch, Benjamin Van Durme, and Chris Michele Banko, Michael J. Cafarella, Stephen Soder- Callison-Burch. 2013. Ppdb: The paraphrase land, Matthew Broadhead, and Oren Etzioni. 2007. . In Proceedings of North American Chap- Open information extraction from the web. In Pro- ter of the Association for Computational Linguistics: ceedings of International Joint Conference on Artif- Human Language Technologies, pages 758–764. ical intelligence, pages 2670–2676. Kiril Gashteovski, Rainer Gemulla, and Luciano Del Corro. 2017. Minie: Minimizing facts in open in- Antoine Bordes, Nicolas Usunier, Alberto Garcia- formation extraction. In Proceedings of Empiri- Duran, Jason Weston, and Oksana Yakhnenko. cal Methods in Natural Language Processing, pages 2013. Translating embeddings for modeling multi- 2620–2630. relational data. In Proceedings of International Conference on Neural Information Processing Sys- Adam Grycner and Gerhard Weikum. 2016. Poly: tems, pages 2787–2795. Mining relational paraphrases from multilingual sentences. In Proceedings of Empirical Methods in Sergey Brin. 1998. Extracting patterns and relations Natural Language Processing, pages 2183–2192. from the . In Proceedings of The World Wide Web and International Work- Luheng He, Kenton Lee, Omer Levy, and Luke Zettle- shop, pages 172–183. moyer. 2018. Jointly predicting predicates and argu- ments in neural semantic role labeling. In Proceed- Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr ings of Association for Computational Linguistics, Settles, Estevam R. Hruschka Jr., and Tom M. pages 364–369. Mitchell. 2010. Toward an architecture for never- Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. ending language learning. In Proceedings of AAAI Long short-term memory. Neural Computation, Conference on Artificial Intelligence, pages 1306– 9(8):1735–1780. 1313. Johannes Hoffart et al. 2011. Robust disambiguation Muhao Chen, Yingtao Tian, Mohan Yang, and Carlo of named entities in text. In Proceedings of Empiri- Zaniolo. 2017. Multilingual em- cal Methods in Natural Language Processing, pages beddings for cross-lingual knowledge alignment. In 782–792. Proceedings of International Joint Conference on Artificial Intelligence, pages 1511–1517. Raphael Hoffmann, Congle Zhang, and Daniel S. Weld. 2010. Learning 5000 relational extractors. In Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- Proceedings of Association for Computational Lin- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger guistics, pages 286–295. Schwenk, and Yoshua Bengio. 2014. Learning Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. phrase representations using rnn encoder–decoder 2017. Distant supervision for relation extraction for statistical machine translation. In Proceedings with sentence-level attention and entity descriptions. of Empirical Methods in Natural Language Process- In Proceedings of AAAI Conference on Artificial In- ing, pages 1724–1734. telligence, pages 3060–3066. Kevin Clark and Christopher D. Manning. 2016. Deep Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: reinforcement learning for mention-ranking corefer- A method for stochastic optimization. In Proceed- ence models. In Proceedings of Empirical Methods ings of International Conference on Learning Rep- in Natural Language Processing, pages 2256–2262. resentations. Nikolaos Kolitsas, Octavian-Eugen Ganea, and Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Thomas Hofmann. 2018. End-to-end neural en- Nguyen, and Dinh Q. Phung. 2018. A novel embed- tity linking. In Proceedings of Conference on ding model for knowledge base completion based Computational Natural Language Learning, pages on convolutional neural network. In Proceedings 519–529. of North American Chapter of the Association for Computational Linguistics: Human Language Tech- Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2017. nologies, volume 2, pages 327–333. Neural relation extraction with multi-lingual atten- tion. In Proceedings of Association for Computa- Thien Huu Nguyen and Ralph Grishman. 2015. Rela- tional Linguistics, volume 1, pages 34–43. tion extraction: Perspective from convolutional neu- ral networks. In Proceedings of Workshop on Vector Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, Space Modeling for Natural Language Processing, and Maosong Sun. 2016. Neural relation extraction pages 39–48. with selective attention over instances. In Proceed- ings of Association for Computational Linguistics, Sebastian Riedel, Limin Yao, and Andrew McCallum. volume 1, pages 2124–2133. 2010. Modeling relations and their mentions with- out labeled text. In Proceedings of European Con- Diego Marcheggiani and Laura Perez-Beltrachini. ference on and Knowledge Dis- 2018. Deep graph convolutional encoders for struc- covery in Databases, pages 148–163. tured data to text generation. Proceedings of Inter- national Conference on Natural Language Genera- Sebastian Riedel, Limin Yao, Andrew McCallum, and tion, pages 1–9. Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Pro- Mausam. 2016. Open information extraction systems ceedings of North American Chapter of the Associ- and downstream applications. In Proceedings of ation for Computational Linguistics: Human Lan- International Joint Conference on Artificial Intelli- guage Technologies, pages 74–84. gence, pages 4074–4077. Christopher De Sa, Alexander Ratner, Christopher Re,´ Mausam, Michael Schmitz, Stephen Soderland, Robert Jaeho Shin, Feiran Wang, Sen Wu, and Ce Zhang. Bart, and Oren Etzioni. 2012. Open language learn- 2017. Incremental knowledge base construction ing for information extraction. In Proceedings of using deepdive. Very Large Data Bases Journal, Empirical Methods in Natural Language Process- 26(1):81–105. ing, pages 523–534. Wei Shen, Jianyong Wang, and Jiawei Han. 2015. En- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. tity linking with a knowledge base: Issues, tech- Corrado, and Jeffrey Dean. 2013. Distributed rep- niques, and solutions. IEEE Trans. Knowl. Data resentations of words and phrases and their com- Eng., 27(2):443–460. positionality. In Proceedings of International Con- Richard Socher, Danqi Chen, Christopher D. Manning, ference on Neural Information Processing Systems, and Andrew Y. Ng. 2013. Reasoning with neural pages 3111–3119. tensor networks for knowledge base completion. In Proceedings of International Conference on Neural David Milne and Ian H. Witten. 2008. An effective, Information Processing Systems, pages 926–934. low-cost measure of semantic relatedness obtained from wikipedia links. In Proceedings of AAAI Work- Daniil Sorokin and Iryna Gurevych. 2017. Context- shop on Wikipedia and Artificial Intelligence, pages aware representations for knowledge base relation 25–30. extraction. In Proceedings of Empirical Methods in Natural Language Processing, pages 1784–1789. Mike Mintz, Steven Bills, Rion Snow, and Daniel Ju- rafsky. 2009. Distant supervision for relation ex- Gabriel Stanovsky, Julian Michael, Ido Dagan, and traction without labeled data. In Proceedings of the Luke Zettlemoyer. 2018. Supervised open infor- Joint Conference of Association for Computational mation extraction. In Proceedings of North Amer- Linguistics and International Joint Conference on ican Chapter of the Association for Computational Natural Language Processing of the AFNLP, pages Linguistics: Human Language Technologies, pages 1003–1011. 885–895. Makoto Miwa and Mohit Bansal. 2016. End-to-end re- Fabian M. Suchanek, Gjergji Kasneci, and Gerhard lation extraction using lstms on sequences and tree Weikum. 2007. Yago: a core of semantic knowl- structures. In Proceedings of Association for Com- edge. In Proceedings of International Conference putational Linguistics. on World Wide Web, pages 697–706. Ndapandula Nakashole, Gerhard Weikum, and Fabian M. Suchanek, Mauro Sozio, and Gerhard Fabian M. Suchanek. 2012. Patty: A taxonomy of Weikum. 2009. SOFIE: a self-organizing frame- relational patterns with semantic types. In Proceed- work for information extraction. In Proceedings of ings of Empirical Methods in Natural Language International Conference on World Wide Web, pages Processing, pages 1135–1145. 631–640. Zequn Sun, Wei Hu, and Chengkai Li. 2017. Cross-lingual entity alignment via joint attribute- preserving embedding. Proceedings of Interna- tional Semantic Web Conference, pages 628–644. Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D. Manning. 2012. Multi-instance multi-label learning for relation extraction. In Pro- ceedings of Empirical Methods in Natural Language Processing, pages 455–465. Bayu Distiawan Trisedya, Jianzhong Qi, and Rui Zhang. 2019. Entity alignment between knowledge graphs using attribute embeddings. In Proceedings of AAAI Conference on Artificial Intelligence. Bayu Distiawan Trisedya, Jianzhong Qi, Rui Zhang, and Wei Wang. 2018. Gtr-lstm: A triple encoder for sentence generation from rdf data. In Proceedings of Association for Computational Linguistics, pages 1627–1637. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of Neural Information Processing Systems, pages 5998–6008. Denny Vrandecic and Markus Krotzsch.¨ 2014. Wiki- data: a free collaborative knowledgebase. Commun. ACM, 57(10):78–85. Quan Wang, Bin Wang, and Li Guo. 2015. Knowl- edge base completion using embeddings and rules. In Proceedings of International Joint Conference on Artificial Intelligence, pages 1859–1865. Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2016. Joint learning of the embedding of words and entities for named entity disambiguation. In Proceedings of Conference on Computational Natural Language Learning, pages 250–259. Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In Pro- ceedings of Empirical Methods in Natural Language Processing, pages 1753–1762. Peng Zhou, Jiaming Xu, Zhenyu Qi, Hongyun Bao, Zhineng Chen, and Bo Xu. 2018. Distant supervi- sion for relation extraction with hierarchical selec- tive attention. Neural Networks, 108:240–247.