A Soft-Label Method for Noise-Tolerant Distantly Supervised Relation Extraction
Total Page:16
File Type:pdf, Size:1020Kb
A Soft-label Method for Noise-tolerant Distantly Supervised Relation Extraction Tianyu Liu, Kexiang Wang, Baobao Chang and Zhifang Sui Key Laboratory of Computational Linguistics, Ministry of Education, School of Electronics Engineering and Computer Science, Peking University, Beijing, China tianyu0421, wkx, chbb, szf @pku.edu.cn { } S-.61:-86., /7,3 8937 . :1"7," 0 Abstract S-.61:-86., /=207 %37 . =237" 0 !(+$ ')()(,$ I I I S-.61:-86., /(8'3,8 . /"A35 0 Distant-supervised relation extraction in- S-.61:-86., /89/ B"C3 . $1DE3 0 evitably suffers from wrong labeling prob- O-822 S24-.6A2 S" N%"7 $53"''87F1*0,07 O ./)('0/1()2 lems because it heuristically labels rela- O-822 T126.6A2 S-.61:-86., N102"'33"7458.2 ) ./"7.0 O .3 tional facts with knowledge bases. Pre- N%-: "86-221:)2B212: O N"9$ ,-76) "4,-.O PH $: -: 6:.2$A62B %"7 $53"'D vious sentence level denoise models don’t N012..1) $-E68O '87 13C '*0,07 2-61 I achieve satisfying performances because EC %"7 $53"''87 ) B81 62 '*0D ,07 F2 -9/-22-11$C2-61 I N782: %6:) 786:-O they use hard labels which are determined RH %"7 $53"''87 13C '*0,07 ) by distant supervision and immutable dur- .82C-$22612:.C13 I N$16 "96:) P4-:1-O O-822 S24-.6A2 ing training. To this end, we introduce an PH 7,3 89374 13C :1"7," NM2/-2.6-:CN108) O$-:02O 1621C6:C2-Q16 -$-/6- H EH I -3.2$C.82C160.-.1$ entity-pair level denoise method which ex- PH I6:C ./"7.0 )CB2C-8B-,2C69M 7,3 89374 13C =1"7," 2C-M -46:2 A6182:02CHHHL2-61 102"'D 28821 .829C-:1C1.82$ ploits semantic information from correctly 33"7458.2 ) -C-186.60-8 2062:.6M 2.8:60C-26-:2 3$19C.82 labeled entity pairs to correct wrong labels 2. B81C2-206-86E22C6:I 01Q:.$,H dynamically during training. We propose O-822 T126.6A2 0$Q2 T126.6A2 a joint score function which combines the Figure 1: An example of soft-label correction on relational scores based on the entity-pair Nationality relation. We intend to use syntactic/ representation and the confidence of the semantic information of correctly labeled entity hard label to obtain a new label, namely pairs (blue) to correct the false positive and false a soft label, for certain entity pair. During negative instances (orange) during training. training, soft labels instead of hard labels serve as gold labels. Experiments on the benchmark dataset show that our method entity pairs might be missing from KBs or mis- dramatically reduces noisy instances and labeled. outperforms the state-of-the-art systems. Multi-instances learning (MIL) is proposed by Riedel et al.(2010) to combat the noise. The 1 Introduction method divides the training set into multiple bags Relation Extraction (RE) aims to obtain relational of entity pairs (shown in Fig1) and labels the bags facts from plain text. Traditional supervised RE with the relations of entity pairs in the KB. Each systems suffer from lack of manually labeled data. bag consists of sentences mentioning both head Mintz et al.(2009) proposes distant supervision, and tail entities. Much effort has been made in which exploits relational facts in knowledge bases reducing the influence of noisy sentences within (KBs). Distant supervision automatically gener- the bag, including methods based on at-least-one ates training examples by aligning entity mentions assumption (Hoffmann et al., 2011; Ritter et al., in plain text with those in KB and labeling entity 2013; Zeng et al., 2015) and attention mechanisms pairs with their relations in KB. If there’s no re- over instances (Lin et al., 2016; Ji et al., 2017). lation link between certain entity pair in KB, it However, the sentence level denoise methods will be labeled as negative instance (NA). How- can’t fully address the wrong labeling problem ever, the automatic labeling inevitably accompa- largely because they use a hard-label method in nies with wrong labels because the relations of which the labels of entity pairs are immutable dur- 1790 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1790–1795 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics ing training, no matter whether they are correct or 2 Methodology not. As shown in Fig1, due to the absence of ( Jan Multi-instances learning (MIL) framework splits Eliasson1 , Sweden) from Nationality relation in the training set M into multiple entity-pair the KB, the entity pair is mislabeled as NA. How- bags h , t , h , t , , h , t . Each bag ever, we find the sentences in the bag of (Jan Elias- {h 1 1i h 2 2i ··· h n ni} h , t contains sentences x , x , , x which son, Sweden) share similar semantic pattern “X of h i ii { 1 2 ··· c} mention both head entity h and tail entity Y” with correctly labeled instances (blue). In the i t . The representation s of bag h , t is a false positive instance, Sebastian Roch is indeed i i h i ii weighted combination of related sentence vectors from France, but the syntactic pattern of the sen- x , x , , x which are encoded by CNN. Fi- tence in the bag differs greatly from those of cor- { 1 2 ··· c} nally, we use soft-label score function to correct rectly labeled instances. Actually, the reliability of wrong labels of bags of entity pairs while comput- a distant-supervised (DS) label can be determined ing probabilities for each relation type. by the syntactic/semantic similarity between cer- tain instance and the potential correctly labeled in- 2.1 Sentence Encoder stances. Soft-label method intends to utilize corre- We get the representation of certain sentence x = sponding similarities to correct wrong DS labels in i w , w , , w by concatenating word em- the training stage dynamically, which means the { 1 2 ··· m} beddings w , w , , w and position embed- same bag may have different soft labels in dif- { 1 2 ··· m} dings (Zeng et al., 2014) p1, p2, , pm , where ferent epochs of training. The basis of soft-label d dw { dp ··· } wi R , wi R , pi R (d = dw + dp). method is the dominance of correctly labeled in- ∈ ∈ ∈ Convolution layer utilizes a sliding window of stances. Fortunately, Xu et al.(2013) proves that l d size l. We define qi R × as the concatenation correctly labeled instances account for 94.4% (in- ∈ of words within the i-th window. cluding true negatives) in the distant-supervised New York Times corpus (benchmark dataset). qi = wi l+1:i(1 i m + l 1) (1) − ≤ ≤ − To this end, we introduce a soft-label method to correct wrong labels at entity-pair level dur- The convolution matrix is denoted by Wc dc (l d) ∈ ing training by exploiting semantic/syntactic in- R × × , where dc is the sentence embedding formation from correctly labeled instances. In our size. The i-th filter of the convolutional layer is model, the representation of certain entity pair is a computed as: weighted combination of related sentences which fi = [Wcq + b]i (2) are encoded by piecewise convolutional neural network (PCNN) (Zeng et al., 2015). Besides, we Afterwards, Piecewise max-pooling (Zeng et al., propose a joint score function to obtain soft labels 2015) is used to divide convolutional filter fi into 1 2 3 during training by taking both the confidence of three parts fi , fi , fi by head and tail enti- DS labels and the entity-pair representations into ties. For example, the sentence “Barack Obama consideration. Our contributions are three-fold: was born in Honululu in 1961” are divided into ‘Barack Obama’, ‘was born in Honululu’ and ‘in To the best of our knowledge, we first 1961’. We perform max-pooling on these three • propose an entity-pair level noise-tolerant parts separately, and the i-th element of sentence vector x Rdc is defined as the concatenation of method while previous works only focused ∈ on sentence level noise. them: 1 2 3 We propose a simple but effective method xi = [max(fi ); max(fi ); max(fi )] (3) • called soft-label method to dynamically cor- 2.2 Sentence Level Weight distribution rect wrong labels during training. Case study The representation of entity pair h , t is defined shows our corrections are of high accuracy. h i ii as a weighted combination of sentences in the bag. We evaluate our model on the benchmark At-least-one: At-least-one assumption is a down • dataset and achieve substantial improvement sampling method which assumes at least one sen- compared with the state-of-the-art systems. tence in the bag will express the relation between two entities, and select the most likely sentence in 1Jan Eliasson is a Swedish diplomat. the bag for training and prediction. To be more 1791 specific, the weight of the selected sentence is 1 while those of other sentences in the bag are all 0. Selective Attention: Lin et al.(2016) proposes se- lective attention mechanism to reduce weights of noisy instances within the entity-pair bag. exp (xiAr) s = αixi; αi = (4) k exp (xkAr) Xi P where αi is the weight of sentence vector xi, A and r are diagonal and relation query parameters. 2.3 Soft-label Adjustment Figure 2: Precision/Recall curves of our model The key of our method is to derive a soft label and previous state-of-the-art systems. Mintz as the gold label for each bag dynamically during (Mintz et al., 2009), MultiR (Hoffmann et al., training, which is not necessarily the same label as 2011) and MIMLRE (Surdeanu et al., 2012) are the distant supervised (DS) label.