Concurrent Learning of Semantic Relations
Total Page:16
File Type:pdf, Size:1020Kb
Concurrent Learning of Semantic Relations Georgios Balikas Gael¨ Dias Kelkoo Group, Grenoble, France University of Caen Normandy, France [email protected] [email protected] Rumen Moraliyski Massih-Reza Amini Kodar Ltd, Plovdiv, Bulgaria University of Grenoble Alps, France [email protected] [email protected] Abstract Most approaches focus on modeling a single se- mantic relation and consist in deciding whether Discovering whether words are semantically a given relation r holds between a pair of words related and identifying the specific semantic (x,y) or not. Within this context, the vast ma- relation that holds between them is of crucial importance for NLP as it is essential for tasks jority of efforts (Snow et al., 2004; Roller et al., like query expansion in IR. Within this con- 2014; Shwartz et al., 2016; Nguyen et al., 2017a) text, different methodologies have been pro- concentrate on hypernymy which is the key orga- posed that either exclusively focus on a single nization principle of semantic memory, but stud- lexical relation (e.g. hypernymy vs. random) ies exist on antonymy (Nguyen et al., 2017b), or learn specific classifiers capable of identi- meronymy (Glavas and Ponzetto, 2017) and co- fying multiple semantic relations (e.g. hyper- hyponymy (Weeds et al., 2014). Another research nymy vs. synonymy vs. random). In this pa- direction consists in dealing with multiple seman- per, we propose another way to look at the problem that relies on the multi-task learn- tic relations at a time and can be defined as de- ing paradigm. In particular, we want to study ciding which semantic relation ri (if any) holds whether the learning process of a given seman- between a pair of words (x; y). This multi-class tic relation (e.g. hypernymy) can be improved problem is challenging as it is known that distin- by the concurrent learning of another seman- guishing between different semantic relations (e.g. tic relation (e.g. co-hyponymy). Within this synonymy and hypernymy) is difficult (Shwartz context, we particularly examine the benefits et al., 2016). Within the CogALex-V shared task1 of semi-supervised learning where the train- ing of a prediction function is performed over which aims at tackling synonymy, antonymy, hy- few labeled data jointly with many unlabeled pernymy and meronymy as a multi-class problem, ones. Preliminary results based on simple best performing systems are proposed by (Shwartz learning strategies and state-of-the-art distri- and Dagan, 2016) and (Attia et al., 2016). butional feature representations show that con- In this paper, we propose another way to look current learning can lead to improvements in a at the problem of semantic relation identifica- vast majority of tested situations. tion based on the following findings. First, sym- metric similarity measures, which capture syn- 1 Introduction onymy (Kiela et al., 2015) have shown to play arXiv:1807.10076v3 [cs.CL] 30 Jul 2018 The ability to automatically identify semantic an important role in hypernymy detection (Santus relations is an important issue for Information et al., 2017). Second, (Yu et al., 2015) show that Retrieval (IR) and Natural Language Processing learning term embeddings that take into account (NLP) applications such as question answering co-hyponymy similarity improves supervised hy- (Dong et al., 2017), query expansion (Kathuria pernymy identification. As a consequence, we et al., 2017), or text summarization (Gambhir and propose to study whether the learning process Gupta, 2017). Semantic relations embody a large of a given semantic relation can be improved number of symmetric and asymmetric linguistic by the concurrent learning of another relation, phenomena such as synonymy (bike $ bicycle), where semantic relations are either synonymy, co- co-hyponymy (bike $ scooter), hypernymy (bike hyponymy or hypernymy. For that purpose, we ! tandem) or meronymy (bike ! chain), but 1https://sites.google.com/site/ more can be enumerated (Vylomova et al., 2016). cogalex2016/home/shared-task propose a multi-task learning strategy using a hard (e.g. X such as Y) that connect the joint occur- parameter sharing neural network model that takes rences of x and y. Within this context, earlier as input a learning word pair (x; y) encoded as a works have been proposed by (Hearst, 1992) (un- feature vector representing the concatenation2 of supervised) and (Snow et al., 2004) (supervised) to the respective word embeddings of x and y noted detect hypernymy. However, this approach suffers −!x ⊕ −!y . The intuition behind our experiments is from sparse coverage and benefits precision over that if the tasks are correlated, the neural network recall. To overcome these limitations, recent one- should improve its generalization ability by taking class studies on hypernymy (Shwartz et al., 2016) into account the shared information. and antonymy (Nguyen et al., 2017b), as well In parallel, we propose to study the general- as multi-class approaches (Shwartz and Dagan, ization ability of the multi-task learning model 2016) have been focusing on representing depen- based on a limited set of labeled word pairs and dency patterns as continuous vectors using long a large number of unlabeled samples, i.e. follow- short-term memory (LSTM) networks. Within this ing a semi-supervised paradigm. As far as we context, successful results have been evidenced know, most related works rely on the existence but (Shwartz et al., 2016; Nguyen et al., 2017b) of a huge number of validated word pairs present also show that the combination of pattern-based in knowledge databases (e.g. WordNet) to per- methods with the distributional approach greatly form the supervised learning process. However, improves performance. such resources may not be available for specific In distributional methods, the decision whether languages or domains. Moreover, it is unlikely x is within a semantic relation with y is based that human cognition and its generalization capac- on the distributional representation of these words ity rely on the equivalent number of positive ex- following the distributional hypothesis (Harris, amples. As such, semi-supervised learning pro- 1954), i.e. on the separate contexts of x and y. poses a much more interesting framework where 3 Earlier works developed symmetric (Dias et al., unlabeled word pairs can massively be obtained 2010) and asymmetric (Kotlerman et al., 2010) through selected lexico-syntactic patterns (Hearst, similarity measures based on discrete represen- 1992) or paraphrase alignments (Dias et al., 2010). tation vectors, followed by numerous supervised To test our hypotheses, we propose a self-learning learning strategies for a wide range of semantic strategy where confidently tagged unlabeled word relations (Baroni et al., 2012; Roller et al., 2014; pairs are iteratively added to the labeled dataset. Weeds et al., 2014), where word pairs are encoded Preliminary results based on simple (semi- as the concatenation of the constituent words rep- supervised) multi-task learning models with state- resentations (−!x ⊕ −!y ) or their vector difference of-the-art word pairs representations (i.e. con- (−!x − −!y ). More recently, attention has been focus- catenation of GloVe (Pennington et al., 2014) ing on identifying semantic relations using neural word embeddings) over the gold standard dataset language embeddings, as such semantic spaces en- ROOT9 (Santus et al., 2016) and the RUMEN code linguistic regularities (Mikolov et al., 2013). dataset proposed in this paper show that classifi- Within this context, (Vylomova et al., 2016) pro- cation improvements can be obtained for a wide posed an exhaustive study for a wide range of se- range of tested configurations. mantic relations and showed that under suitable supervised training, high performance can be ob- 2 Related Work tained. However, (Vylomova et al., 2016) also Whether semantic relation identification has been showed that some relations such as hypernymy tackled as a one-class or a multi-class problem, are more difficult to model than others. As a two main approaches have been addressed to cap- consequence, new proposals have appeared that ture the semantic links between two words (x; y): tune word embeddings for this specific task, where pattern-based and distributional. Pattern-based hypernyms and hyponyms should be closed to (a.k.a. path-based) methods base their decisions each other in the semantic space (Yu et al., 2015; on the analysis of the lexico-syntactic patterns Nguyen et al., 2017a). In this paper, we propose an attempt to deal with 2Best configuration reported in (Shwartz et al., 2016) for standard non path-based supervised learning. semantic relation identification based on a multi- 3Even though with some error rate. task strategy, as opposed to previous one-class and multi-class approaches. Our main scope is to an- that hold between words can benefit classification alyze whether a link exists between the learning models across similar tasks. For instance, learning process of related semantic relations. The clos- that bike is the hypernym of mountain bike should est approach to ours is proposed by (Attia et al., help while classifying mountain bike and tandem 2016), which develops a multi-task convolutional bicycle as co-hyponyms, as it is likely that tan- neural network for multi-class semantic relation dem bicycle shares some relation with bike. To classification supported by relatedness classifica- test this hypothesis, we propose to use a multi- tion. As such, it can be seen as a domain adap- task learning approach. Multi-task learning (Caru- tation problem. Within the scope of our paper, ana, 1998) has empirically been validated and has we aim at studying semantic inter-relationships at shown to be effective in a variety of NLP tasks a much finer grain and understanding the cogni- ranging from sentiment analysis, part-of-speech tive links that may exist between synonymy, co- tagging and text parsing (Braud et al., 2016; Bin- hyponymy and hypernymy, that form the back- gel and Søgaard, 2017).