<<

A Soft-label Method for Noise-tolerant Distantly Supervised Relation Extraction

Tianyu Liu, Kexiang Wang, Baobao Chang and Zhifang Sui Key Laboratory of Computational Linguistics, Ministry of Education, School of Electronics Engineering and Computer Science, Peking University, Beijing, China tianyu0421, wkx, chbb, szf @pku.edu.cn { }

&-?6;:-86?D /$/3 8637 . )1+7/+ 0 Abstract &-?6;:-86?D /!207 %37 . !237+ 0

!(+# ')$*$,# I I I &-?6;:-86?D /(8;<,8 . :+@35 0 Distant-supervised relation extraction in- &-?6;:-86?D /86: *+43 . "1?9< 0 evitably suffers from wrong labeling prob- #-8>2 &24-?6A2 &" N%+7 "53+;;87B'>0/07 O ."*$'&"%$*- lems because it heuristically labels rela- #-8>2 ';>6?6A2 &-?6;:-86?D N'0-+;<3+74&8.2 G #:+7.0 O .3

tional facts with knowledge bases. Pre- N%-: "86->>;:G>B212: O N"9= ,-76G "4D?.;G =-E68O ;87 ;3C ;>0/07 >-61 I achieve satisfying performances because QC %+7 "53+;;87 G B5; 6> ;>0D /07 J> -9/->>-1;=C>-61 I N!52: %6:G !56:-O they use hard labels which are determined RH %+7 "53+;;87 ;3C ;>0/07 G by distant supervision and immutable dur- ?52C<=2>612:?C;3 I N$16 "96:G +4-:1-O #-8>2 &24-?6A2 ing training. To this end, we introduce an PH $/3 86374 ;3C )1+7/+ N)2/->?6-:C(;05G #=-:02O 1621C6:C>-@16 -=-/6- H QH I -3?2=C?52C160?-?;= entity-pair level denoise method which ex- PH K6:C #:+7.0 GCB2C-8B-D>C69M $/3 86374 ;3C =1+7/+ 2C-61 '0-+;D 28821 ?529C-:1C;?52= ploits semantic information from correctly <3+74&8.2 G -C<;86?60-8 >062:?6M 2?5:60C->6-:> 3=;9C?52 labeled entity pairs to correct wrong labels >? B5;C><206-86E2>C6:I 0;@:?=DH dynamically during training. We propose #-8>2 ';>6?6A2 *=@2 ';>6?6A2 a joint score function which combines the Figure 1: An example of soft-label correction on relational scores based on the entity-pair Nationality relation. We intend to use syntactic/ representation and the confidence of the semantic information of correctly labeled entity hard label to obtain a new label, namely pairs (blue) to correct the false positive and false a soft label, for certain entity pair. During negative instances (orange) during training. training, soft labels instead of hard labels serve as gold labels. Experiments on the benchmark dataset show that our method entity pairs might be missing from KBs or mis- dramatically reduces noisy instances and labeled. outperforms the state-of-the-art systems. Multi-instances learning (MIL) is proposed by Riedel et al.(2010) to combat the noise. The 1 Introduction method divides the training set into multiple bags Relation Extraction (RE) aims to obtain relational of entity pairs (shown in Fig1) and labels the bags facts from plain text. Traditional supervised RE with the relations of entity pairs in the KB. Each systems suffer from lack of manually labeled data. bag consists of sentences mentioning both head Mintz et al.(2009) proposes distant supervision, and tail entities. Much effort has been made in which exploits relational facts in knowledge bases reducing the influence of noisy sentences within (KBs). Distant supervision automatically gener- the bag, including methods based on at-least-one ates training examples by aligning entity mentions assumption (Hoffmann et al., 2011; Ritter et al., in plain text with those in KB and labeling entity 2013; Zeng et al., 2015) and attention mechanisms pairs with their relations in KB. If there’s no re- over instances (Lin et al., 2016; Ji et al., 2017). lation link between certain entity pair in KB, it However, the sentence level denoise methods will be labeled as negative instance (NA). How- can’t fully address the wrong labeling problem ever, the automatic labeling inevitably accompa- largely because they use a hard-label method in nies with wrong labels because the relations of which the labels of entity pairs are immutable dur-

1790 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1790–1795 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics ing training, no matter whether they are correct or 2 Methodology not. As shown in Fig1, due to the absence of ( Jan Multi-instances learning (MIL) framework splits Eliasson1 , ) from Nationality relation in the training set M into multiple entity-pair the KB, the entity pair is mislabeled as NA. How- bags h , t , h , t , , h , t . Each bag ever, we find the sentences in the bag of (Jan Elias- {h 1 1i h 2 2i ··· h n ni} h , t contains sentences x , x , , x which son, Sweden) share similar semantic pattern “X of h i ii { 1 2 ··· c} mention both head entity h and tail entity Y” with correctly labeled instances (blue). In the i t . The representation s of bag h , t is a false positive instance, Sebastian Roch is indeed i i h i ii weighted combination of related sentence vectors from France, but the syntactic pattern of the sen- x , x , , x which are encoded by CNN. Fi- tence in the bag differs greatly from those of cor- { 1 2 ··· c} nally, we use soft-label score function to correct rectly labeled instances. Actually, the reliability of wrong labels of bags of entity pairs while comput- a distant-supervised (DS) label can be determined ing probabilities for each relation type. by the syntactic/semantic similarity between cer- tain instance and the potential correctly labeled in- 2.1 Sentence Encoder stances. Soft-label method intends to utilize corre- We get the representation of certain sentence x = sponding similarities to correct wrong DS labels in i w , w , , w by concatenating word em- the training stage dynamically, which means the { 1 2 ··· m} beddings w , w , , w and position embed- same bag may have different soft labels in dif- { 1 2 ··· m} dings (Zeng et al., 2014) p1, p2, , pm , where ferent epochs of training. The basis of soft-label d dw { dp ··· } wi R , wi R , pi R (d = dw + dp). method is the dominance of correctly labeled in- ∈ ∈ ∈ Convolution layer utilizes a sliding window of stances. Fortunately, Xu et al.(2013) proves that l d size l. We define qi R × as the concatenation correctly labeled instances account for 94.4% (in- ∈ of words within the i-th window. cluding true negatives) in the distant-supervised New York Times corpus (benchmark dataset). qi = wi l+1:i(1 i m + l 1) (1) − ≤ ≤ − To this end, we introduce a soft-label method to correct wrong labels at entity-pair level dur- The convolution matrix is denoted by Wc dc (l d) ∈ ing training by exploiting semantic/syntactic in- R × × , where dc is the sentence embedding formation from correctly labeled instances. In our size. The i-th filter of the convolutional layer is model, the representation of certain entity pair is a computed as: weighted combination of related sentences which fi = [Wcq + b]i (2) are encoded by piecewise convolutional neural network (PCNN) (Zeng et al., 2015). Besides, we Afterwards, Piecewise max-pooling (Zeng et al., propose a joint score function to obtain soft labels 2015) is used to divide convolutional filter fi into 1 2 3 during training by taking both the confidence of three parts fi , fi , fi by head and tail enti- DS labels and the entity-pair representations into ties. For example, the sentence “  consideration. Our contributions are three-fold: was born in Honululu in 1961” are divided into ‘Barack Obama’, ‘was born in Honululu’ and ‘in To the best of our knowledge, we first 1961’. We perform max-pooling on these three • propose an entity-pair level noise-tolerant parts separately, and the i-th element of sentence vector x Rdc is defined as the concatenation of method while previous works only focused ∈ on sentence level noise. them: 1 2 3 We propose a simple but effective method xi = [max(fi ); max(fi ); max(fi )] (3) • called soft-label method to dynamically cor- 2.2 Sentence Level Weight distribution rect wrong labels during training. Case study The representation of entity pair h , t is defined shows our corrections are of high accuracy. h i ii as a weighted combination of sentences in the bag. We evaluate our model on the benchmark At-least-one: At-least-one assumption is a down • dataset and achieve substantial improvement sampling method which assumes at least one sen- compared with the state-of-the-art systems. tence in the bag will express the relation between two entities, and select the most likely sentence in 1Jan Eliasson is a Swedish diplomat. the bag for training and prediction. To be more

1791 specific, the weight of the selected sentence is 1 while those of other sentences in the bag are all 0. Selective Attention: Lin et al.(2016) proposes se- lective attention mechanism to reduce weights of noisy instances within the entity-pair bag.

exp (xiAr) s = αixi; αi = (4) k exp (xkAr) Xi P where αi is the weight of sentence vector xi, A and r are diagonal and relation query parameters.

2.3 Soft-label Adjustment Figure 2: Precision/Recall curves of our model The key of our method is to derive a soft label and previous state-of-the-art systems. Mintz as the gold label for each bag dynamically during (Mintz et al., 2009), MultiR (Hoffmann et al., training, which is not necessarily the same label as 2011) and MIMLRE (Surdeanu et al., 2012) are the distant supervised (DS) label. We still use DS feature-based models. ONE (Zeng et al., 2015) labels while testing. and ATT (Lin et al., 2016) are neural network The soft label is determined dynamically, which models based on at-least-one assumption and se- means the same bag may have different soft labels lective attention, respectively. in different training epochs. we propose follow- ing joint function to determine the soft label ri for entity pair h , t : 3 Experiments h i ii

ri = arg max(o + max(o)A Li) (5) In this section, we first introduce the dataset and evaluation metrics in our experiments. Then, we dr where o, A,Li R (dr is the number of pre- demonstrate the parameter settings in our experi- ∈ defined relations). One-hot vector Li indicates ments. Besides, we compare the performance of the DS label of hi, ti . Relation Confidence our method with state-of-the-art feature-based and h i vector A represents the reliability of DS labels. neural network baselines. Case study shows our Each element in A is a decimal between 0 and soft-label corrections are of high accuracy. 1, which indicates the confidence of correspond- ing DS labeled relation type. operation rep- 3.1 Dataset and Evaluation Metrics resents element-wise production. o is the vector of relational scores based on the entity-pair repre- We evaluate our model on the benchmark dataset sentation s of h , t . max(o) is a scaling con- proposed by Mintz et al.(2009), which has also i h i ii stant which restricts the effect of the DS label. been used by Riedel et al.(2010), Hoffmann et al. The score of the t-th relation type ot is calculated (2011), Zeng et al.(2015) and Lin et al.(2016). based on the trained relation matrice M and bias The dataset uses Freebase (Bollacker et al., 2008) b: as distant-supervised knowledge base and New exp (Mst + b) York Times (NYT) corpus as text resource. Sen- ot = (6) k exp (Msk + b) tences in NYT of the years 2005-2006 are used as training set while sentences in NYT of 2007 are We use entity-pairP level cross-entropy loss func- tion using soft labels as gold labels while training: used as testing set. There are 53 possible relations including NA which indicates no relation. The n training data contains 522611 sentences, 281270 J(θ) = log p(r s ; θ) (7) i| i entity pairs and 18252 relational facts. The test- i=1 X ing data contains 172448 sentences, 96678 entity In the testing stage, we still use the DS label l i pairs and 1950 relational facts. of certain entity pair hi, ti as the gold label: h i Similar to the previous work, We report both n aggregate precision/recall curves and top-N preci- G(θ) = log p(l s ; θ) (8) i| i sion (P@N). Xi=1 1792 Window size Word dimension Position dimension Filter dimension Batch size Learning rate Dropout l = 3 dw = 50 dp = 5 dc = 230 B = 160 λ = 0.001 p = 0.5

Table 1: Parameter settings of our experiments.

Settings One Two All P@N(%) 100 200 300 Avg 100 200 300 Avg 100 200 300 Avg ONE 73.3 64.8 56.8 65.0 70.3 67.2 63.1 66.9 72.3 69.7 64.1 68.7 +soft-label 77.0 72.5 67.7 72.4 80.0 74.5 69.7 74.7 84.0 81.0 74.0 79.7 ATT 73.3 69.2 60.8 67.8 77.2 71.6 66.1 71.6 76.2 73.1 67.4 72.2 +soft-label 84.0 75.5 68.3 75.9 86.0 77.0 73.3 78.8 87.0 84.5 77.0 82.8

Table 2: Top-N precision (P@N) for relation extraction in the entity pairs with different number of sen- tences. Following (Lin et al., 2016), One, Two and All test settings random select one/two/all sentences on the bags of entity pairs from the testing set which have more than one sentence to predict relation.

3.2 Comparison with previous work False positive: Place lived Place of death → , one of canada ’s foremost Mintz (Mintz et al., 2009), MultiR (Hoffmann Fernand nault dance figures , died in on tuesday . et al., 2011) and MIMLRE (Surdeanu et al., 2012) montreal False positive: Place lived NA are feature-based models. PCNN-ONE (Zeng → et al., 2015) and PCNN-ATT (Lin et al., 2016) are , a daughter of representat- ive , and of san piecewise convolutional neural network (PCNN) ··· francisco, was married yesterday to . models based on at-least-one assumption and se- ··· False Negative: NA Nationality lective attention, which are introduced in Section → 2.2, respectively. All the results of compared mod- By spring the renowned chef Gordon Ram- els come from the data reported in their papers. say of England should be in hotels here. False Negative: NA Work in → 3.3 Experimental Settings , said Billy Ccox , a spokesman for the ··· . We use cross-validation to determine the parame- United States Department of Agriculture ters in our model. Soft-label method uses PCNN- Table 3: Some examples of soft-label corrections ONE/PCNN-ATT to represent the bags of entity while training pairs, and we don’t tune on the parameters of PCNN-ONE/PCNN-ATT for fair comparsion. So using soft labels gets a slightly better performance we use the same pre-trained word embeddings and than PCNN-ATT. (3) When recall is between 0.05 parameters of CNN encoder as those of Lin et al. and 0.15, the curve of our model ATT+soft-label (2016). Detailed parameter settings are shown in is relatively stable, which demonstrates soft-label Table1. Moreover, we use Adam optimizer. Be- can obtain relatively stable performance in extract- sides, to avoid negative effects of dominant NA ing relational facts. instances in the begining of training, soft-label method is adopted after 3000 steps of parameter 3.5 Top N precision updates. The confidence vector A is heuristically set as [0.9, 0.7, , 0.7] (the confidence of NA is ··· Table2 shows top-N precision (P@N) of the state- 0.9 while confidence of other relations are all 0.7). of-the-art systems and our model. We can see that (1) For both PCNN-ONE and PCNN-ATT model, 3.4 Precision Recall Curve soft-label method improves the precisions by over We have following observations from Figure2: (1) 10% in all test settings, which demonstrates the For both ATT and ONE configuration, soft-label effects of our model. (2) Even a weaker baseline method achieves higher precision than baselines (PCNN-ONE) with soft-label method achieves when recall is greater than 0.05. After manual higher precision than a strong model (PCNN- evaluation, we find that most wrong instances with ATT). It shows that entity-pair level denoise model less than 0.05 recall are wrong labeling entity pairs performs much better than the models which only in test set. (2) Even weaker baseline PCNN-ONE focus on sentence level noise.

1793 Case 1: Place of Birth Nationality wrong corrections. Table4 lists two typical wrong → Marcus Samuelsson began when he was corrections during training. Wrong corrections ··· visiting his native . like Case 1 fail to distinguish similar relations Marcus Samuelsson chef born in Ethiopia (both Nationality and place of birth are relations and raised in Sweden . between people and locations) between entities ··· Case 2: Location Contains NA because of their similar sentence patterns. How- → , he is from neighboring towns in Georgia ever, wrong corrections like Case 1 are rare (5/39) ··· (such as Blairsville and Young Harris) in our experiments. Soft-label method can still dis- tinguish most similiar relations as shown in Sec Table 4: Two typical wrong corrections of soft- 3.6. In Case 2, factual relation location contains label adjustment during training. is mistaken as NA partially because the relational 3.6 Case Study pattern of this sentence is somewhat different from the regular location contains pattern. Addition- Some examples of soft-label corrections during ally, soft-label method has a tendency to label am- training are shown in Table3. We can see that biguous facts as NA because negative instances soft-label method can recognize both false posi- (NA) are dominated in the corpus. However, most tives and false negatives during training and cor- bags which are soft-labeled as NA are still well- rect wrong labels successfully. The two sentences labeled in our experiments. above are mislabeled as place lived because triple We argue that the minor wrong corrections of facts (Fernand nault, place lived, Montreal) and relational facts during training don’t affect the (Alexandra pelosi, place lived, ) ex- overall performance much because distant super- ist in Freebase. However, the two sentences fail vision doesn’t lack instances of relational facts due to express place lived relation. Our model can au- to its strong ability to automatically label large tomatically correct them by soft-label adjustment. web text. The two sentences below show that our model can also change false negative (NA) examples caused 5 Conclusion and Future Work by missing facts in Freebase to correct ones. Besides, our model has strong ability to distin- This paper proposes a noise-tolerant method to guish different relational patterns, even for similar combat wrong labels in distant-supervised relation relations like Place lived, Place of born, Place of extraction with soft labels. Our model focuses Death. on entity-pair level noise while previous models only dealt with sentence level noise. Our model 4 Error Analysis achieves significant improvement over baselines on the benchmark dataset. Case study shows that We randomly select 200 instances of soft-label soft-label corrections are of high accuracy. corrections during training for PCNN-ONE and In the future, we plan to develop a new measure- PCNN-ATT respectively and check them manu- ment for the reliability of certain distantly super- ally. The accuracy of soft-label corrections for vised label by evaluating the corresponding sim- PCNN-ONE is 88.5% (177/200) while that for ilarity between certain instance and the potential PCNN-ATT is 92% (184/200). We notice that correctly labeled instances instead of using heuris- the accuarcy of PCNN-ATT+soft-label is slightly tically set confidence vector. In addition, we tend higher than that of PCNN-ONE+soft-label. The to find a more suitable sentence encoder rather condition is the same as our expectation. As ex- than piece-wise CNN for our soft-label method. plained in Sec 2.2, PCNN-ATT has better bag rep- resentations than PCNN-ONE because it can re- Acknowledgments duce the effect of noisy instances within the bag. The soft-label of certain bag is determined by its We thank the anonymous reviewers for their valu- bag representation and the confidence of corre- able comments. This work is supported by the Na- sponding DS label. So the accuracy of soft-label tional Key Basic Research Program of China (No. corrections for PCNN-ATT can benefit from better 2014CB340504) and the National Natural Science bag representations. Foundation of China (No.61375074, 61273318). Although most of soft-label corrections are of The contact authors are Zhifang Sui and Baobao high accuracy (90.25%), there are still several Chang.

1794 References piecewise convolutional neural networks. In Pro- ceedings of the 2015 Conference on Empirical Meth- Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim ods in Natural Language Processing (EMNLP), Lis- Sturge, and Jamie Taylor. 2008. Freebase: a collab- bon, Portugal, pages 17–21. oratively created graph database for structuring hu- man knowledge. In Proceedings of the 2008 ACM Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, SIGMOD international conference on Management Jun Zhao, et al. 2014. Relation classification via of data, pages 1247–1250. AcM. convolutional deep neural network. In COLING, pages 2335–2344. Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge- based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computa- tional Linguistics: Human Language Technologies- Volume 1, pages 541–550. Association for Compu- tational Linguistics.

Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. 2017. Distant supervision for relation extraction with sentence-level attention and entity descriptions. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pages 3060–3066.

Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural relation extraction with selective attention over instances. In Proceed- ings of ACL, volume 1, pages 2124–2133.

Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- sky. 2009. Distant supervision for relation extrac- tion without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Vol- ume 2-Volume 2, pages 1003–1011. Association for Computational Linguistics.

Sebastian Riedel, Limin Yao, and Andrew McCal- lum. 2010. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases, pages 148–163. Springer.

Alan Ritter, Luke Zettlemoyer, Oren Etzioni, et al. 2013. Modeling missing data in distant supervision for information extraction. Transactions of the As- sociation for Computational Linguistics, 1:367–378.

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance multi-label learning for relation extraction. In Pro- ceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Com- putational Natural Language Learning, pages 455– 465. Association for Computational Linguistics.

Wei Xu, Raphael Hoffmann, Le Zhao, and Ralph Gr- ishman. 2013. Filling knowledge base gaps for dis- tant supervision of relation extraction. In ACL (2), pages 665–670.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via

1795