Relational Word Embeddings

Relational Word Embeddings Jose Camacho-Collados Luis Espinosa-Anke Steven Schockaert School of Computer Science and Informatics Cardiff University, United Kingdom fcamachocolladosj,espinosa-ankel,[email protected] Abstract Previous work has addressed this by incorporating external knowledge graphs (Xu et al., 2014; While word embeddings have been shown to implicitly encode various forms of attribu- Celikyilmaz et al., 2015) or relations extracted tional knowledge, the extent to which they from text (Chen et al., 2016). However, the suc- capture relational information is far more lim- cess of such approaches depends on the amount ited. In previous work, this limitation has been of available relational knowledge. Moreover, they addressed by incorporating relational knowl- only consider well-defined discrete relation types edge from external knowledge bases when (e.g. is the capital of, or is a part of ), whereas learning the word embedding. Such strategies the appeal of vector space representations largely may not be optimal, however, as they are lim- comes from their ability to capture subtle aspects ited by the coverage of available resources and conflate similarity with other forms of related- of meaning that go beyond what can be expressed ness. As an alternative, in this paper we pro- symbolically. For instance, the relationship be- pose to encode relational knowledge in a sep- tween popcorn and cinema is intuitively clear, but arate word embedding, which is aimed to be it is more subtle than the assertion that “popcorn complementary to a given standard word em- is located at cinema”, which is how ConceptNet bedding. This relational word embedding is (Speer et al., 2017), for example, encodes this re- still learned from co-occurrence statistics, and lationship1. can thus be used even when no external knowledge base is available. Our analysis shows that In fact, regardless of how a word embedding is relational word vectors do indeed capture in- learned, if its primary aim is to capture similarity, formation that is complementary to what is en- there are inherent limitations on the kinds of re- coded in standard word embeddings. lations they can capture. For instance, such word 1 Introduction embeddings can only encode similarity preserving relations (i.e. similar entities have to be related to Word embeddings are paramount to the success of similar entities) and it is often difficult to encode current natural language processing (NLP) meth- that w is in a particular relationship while prevent- ods. Apart from the fact that they provide a con- ing the inference that words with similar vectors venient mechanism for encoding textual informa- to w are also in this relationship; e.g. Bouraoui tion in neural network models, their importance et al.(2018) found that both ( Berlin,Germany) and mainly stems from the remarkable amount of lin- (Moscow,Germany) were predicted to be instances guistic and semantic information that they capture. of the capital-of relation due to the similarity of For instance, the vector representation of the word the word vectors for Berlin and Moscow. Fur- Paris implicitly encodes that this word is a noun, thermore, while the ability to capture word analo- and more specifically a capital city, and that it gies (e.g. king-man+woman≈queen) emerged as describes a location in France. This information a successful illustration of how word embeddings arises because word embeddings are learned from can encode some types of relational information co-occurrence counts, and properties such as be- (Mikolov et al., 2013b), the generalization of this ing a capital city are reflected in such statistics. interesting property has proven to be less success- However, the extent to which relational knowledge ful than initially anticipated (Levy et al., 2014; (e.g. Trump was the successor of Obama) can be learned in this way is limited. 1http://conceptnet.io/c/en/popcorn 3286 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3286–3296 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics Linzen, 2016; Rogers et al., 2017). Taking a slightly different approach, Washio This suggests that relational information has to and Kato(2018a) train a neural network to pre- be encoded separately from standard similarity- dict dependency paths from a given word pair. centric word embeddings. One appealing strategy Their approach uses standard word vectors as in- is to represent relational information by learning, put, hence relational information is encoded im- for each pair of related words, a vector that en- plicitly in the weights of the neural network, rather codes how the words are related. This strategy was than as relation vectors (although the output of this first adopted by Turney(2005), and has recently neural network, for a given word pair, can still been revisited by a number of authors (Washio and be seen as a relation vector). An advantage of Kato, 2018a; Jameel et al., 2018; Espinosa Anke this approach, compared to methods that explic- and Schockaert, 2018; Washio and Kato, 2018b; itly construct relation vectors, is that evidence ob- Joshi et al., 2019). However, in many applications, tained for one word is essentially shared with sim- word vectors are easier to deal with than vector ilar words (i.e. words whose standard word vec- representations of word pairs. tor is similar). Among others, this means that The research question we consider in this pa- their approach can in principle model relational per is whether it is possible to learn word vectors knowledge for word pairs that never co-occur in that capture relational information. Our aim is for the same sentence. A related approach, presented such relational word vectors to be complementary in (Washio and Kato, 2018b), uses lexical patterns, to standard word vectors. To make relational in- as in the LRA method, and trains a neural network formation available to NLP models, it then suffices to predict vector encodings of these patterns from to use a standard architecture and replace normal two given word vectors. In this case, the word vec- word vectors by concatenations of standard and tors are updated together with the neural network relational word vectors. In particular, we show and an LSTM to encode the patterns. Finally, sim- that such relational word vectors can be learned ilar approach is taken by the Pair2Vec method pro- directly from a given set of relation vectors. posed in (Joshi et al., 2019), where the focus is on learning relation vectors that can be used for 2 Related Work cross-sentence attention mechanisms in tasks such as question answering and textual entailment. Relation Vectors. A number of approaches have Despite the fact that such methods learn word been proposed that are aimed at learning relation vectors from which relation vectors can be pre- vectors for a given set of word pairs (a,b), based dicted, it is unclear to what extent these word vec- on sentences in which these word pairs co-occur. tors themselves capture relational knowledge. In For instance, Turney(2005) introduced a method particular, the aforementioned methods have thus called Latent Relational Analysis (LRA), which far only been evaluated in settings that rely on relies on first identifying a set of sufficiently fre- the predicted relation vectors. Since these pre- quent lexical patterns and then constructs a ma- dictions are made by relatively sophisticated neu- trix which encodes for each considered word pair ral network architectures, it is possible that most (a,b) how frequently each pattern P appears in of the relational knowledge is still captured in the between a and b in sentences that contain both weights of these networks, rather than in the word words. Relation vectors are then obtained us- vectors. Another problem with these existing ap- ing singular value decomposition. More recently, proaches is that they are computationally very ex- Jameel et al.(2018) proposed an approach inspired pensive to train; e.g. the Pair2Vec model is re- by the GloVe word embedding model (Pennington ported to need 7-10 days of training on unspecified et al., 2014) to learn relation vectors based on co- hardware2. In contrast, the approach we propose occurrence statistics between the target word pair in this paper is computationally much simpler, (a; b) and other words. Along similar lines, Es- while resulting in relational word vectors that en- pinosa Anke and Schockaert(2018) learn relation code relational information more accurately than vectors based on the distribution of words occur- those of the Pair2Vec model in lexical semantics ring in sentences that contain a and b by averag- tasks, as we will see in Section5. ing the word vectors of these co-occurring words. Knowledge-Enhanced Word Embeddings. Sev- Then, a conditional autoencoder is used to obtain lower-dimensional relation vectors. 2github.com/mandarjoshi90/pair2vec 3287 eral authors have tried to improve word embed- tors we mainly follow the initialization method of dings by incorporating external knowledge bases. Camacho-Collados et al.(2019,R ELATIVEinit) ex- For example, some authors have proposed models cept for an important difference explained below which combine the loss function of a word em- regarding the symmetry of the relations. Other re- bedding model, to ensure that word vectors are lation embedding methods could be used as well, predictive of their context words, with the loss e.g., (Jameel et al., 2018; Washio and Kato, 2018b; function of a knowledge graph embedding model, Espinosa Anke and Schockaert, 2018; Joshi et al., to encourage the word vectors to additionally be 2019), but this method has the advantage of being predictive of a given set of relational facts (Xu highly efficient. In the following we describe this et al., 2014; Celikyilmaz et al., 2015; Chen et al., procedure for learning relation vectors: we first 2016).

Load more