Relational Word Embeddings

Jose Camacho-Collados Luis Espinosa-Anke Steven Schockaert School of Computer Science and Informatics Cardiff University, United Kingdom {camachocolladosj,espinosa-ankel,schockaerts1}@cardiff.ac.uk

Abstract Previous work has addressed this by incorporat- ing external knowledge graphs (Xu et al., 2014; While word embeddings have been shown to implicitly encode various forms of attribu- Celikyilmaz et al., 2015) or relations extracted tional knowledge, the extent to which they from text (Chen et al., 2016). However, the suc- capture relational information is far more lim- cess of such approaches depends on the amount ited. In previous work, this limitation has been of available relational knowledge. Moreover, they addressed by incorporating relational knowl- only consider well-defined discrete relation types edge from external knowledge bases when (e.g. is the capital of, or is a part of ), whereas learning the . Such strategies the appeal of vector space representations largely may not be optimal, however, as they are lim- comes from their ability to capture subtle aspects ited by the coverage of available resources and conflate similarity with other forms of related- of meaning that go beyond what can be expressed ness. As an alternative, in this paper we pro- symbolically. For instance, the relationship be- pose to encode relational knowledge in a sep- tween popcorn and cinema is intuitively clear, but arate word embedding, which is aimed to be it is more subtle than the assertion that “popcorn complementary to a given standard word em- is located at cinema”, which is how ConceptNet bedding. This relational word embedding is (Speer et al., 2017), for example, encodes this re- still learned from co-occurrence statistics, and lationship1. can thus be used even when no external knowl- edge base is available. Our analysis shows that In fact, regardless of how a word embedding is relational word vectors do indeed capture in- learned, if its primary aim is to capture similarity, formation that is complementary to what is en- there are inherent limitations on the kinds of re- coded in standard word embeddings. lations they can capture. For instance, such word 1 Introduction embeddings can only encode similarity preserving relations (i.e. similar entities have to be related to Word embeddings are paramount to the success of similar entities) and it is often difficult to encode current natural language processing (NLP) meth- that w is in a particular relationship while prevent- ods. Apart from the fact that they provide a con- ing the inference that words with similar vectors venient mechanism for encoding textual informa- to w are also in this relationship; e.g. Bouraoui tion in neural network models, their importance et al.(2018) found that both ( Berlin,Germany) and mainly stems from the remarkable amount of lin- (Moscow,Germany) were predicted to be instances guistic and semantic information that they capture. of the capital-of relation due to the similarity of For instance, the vector representation of the word the word vectors for Berlin and Moscow. Fur- Paris implicitly encodes that this word is a noun, thermore, while the ability to capture word analo- and more specifically a capital city, and that it gies (e.g. king-man+woman≈queen) emerged as describes a location in France. This information a successful illustration of how word embeddings arises because word embeddings are learned from can encode some types of relational information co-occurrence counts, and properties such as be- (Mikolov et al., 2013b), the generalization of this ing a capital city are reflected in such statistics. interesting property has proven to be less success- However, the extent to which relational knowledge ful than initially anticipated (Levy et al., 2014; (e.g. Trump was the successor of Obama) can be learned in this way is limited. 1http://conceptnet.io/c/en/popcorn

3286 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3286–3296 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics Linzen, 2016; Rogers et al., 2017). Taking a slightly different approach, Washio This suggests that relational information has to and Kato(2018a) train a neural network to pre- be encoded separately from standard similarity- dict dependency paths from a given word pair. centric word embeddings. One appealing strategy Their approach uses standard word vectors as in- is to represent relational information by learning, put, hence relational information is encoded im- for each pair of related words, a vector that en- plicitly in the weights of the neural network, rather codes how the words are related. This strategy was than as relation vectors (although the output of this first adopted by Turney(2005), and has recently neural network, for a given word pair, can still been revisited by a number of authors (Washio and be seen as a relation vector). An advantage of Kato, 2018a; Jameel et al., 2018; Espinosa Anke this approach, compared to methods that explic- and Schockaert, 2018; Washio and Kato, 2018b; itly construct relation vectors, is that evidence ob- Joshi et al., 2019). However, in many applications, tained for one word is essentially shared with sim- word vectors are easier to deal with than vector ilar words (i.e. words whose standard word vec- representations of word pairs. tor is similar). Among others, this means that The research question we consider in this pa- their approach can in principle model relational per is whether it is possible to learn word vectors knowledge for word pairs that never co-occur in that capture relational information. Our aim is for the same sentence. A related approach, presented such relational word vectors to be complementary in (Washio and Kato, 2018b), uses lexical patterns, to standard word vectors. To make relational in- as in the LRA method, and trains a neural network formation available to NLP models, it then suffices to predict vector encodings of these patterns from to use a standard architecture and replace normal two given word vectors. In this case, the word vec- word vectors by concatenations of standard and tors are updated together with the neural network relational word vectors. In particular, we show and an LSTM to encode the patterns. Finally, sim- that such relational word vectors can be learned ilar approach is taken by the Pair2Vec method pro- directly from a given set of relation vectors. posed in (Joshi et al., 2019), where the focus is on learning relation vectors that can be used for 2 Related Work cross-sentence attention mechanisms in tasks such as and . Relation Vectors. A number of approaches have Despite the fact that such methods learn word been proposed that are aimed at learning relation vectors from which relation vectors can be pre- vectors for a given set of word pairs (a,b), based dicted, it is unclear to what extent these word vec- on sentences in which these word pairs co-occur. tors themselves capture relational knowledge. In For instance, Turney(2005) introduced a method particular, the aforementioned methods have thus called Latent Relational Analysis (LRA), which far only been evaluated in settings that rely on relies on first identifying a set of sufficiently fre- the predicted relation vectors. Since these pre- quent lexical patterns and then constructs a ma- dictions are made by relatively sophisticated neu- trix which encodes for each considered word pair ral network architectures, it is possible that most (a,b) how frequently each pattern P appears in of the relational knowledge is still captured in the between a and b in sentences that contain both weights of these networks, rather than in the word words. Relation vectors are then obtained us- vectors. Another problem with these existing ap- ing singular value decomposition. More recently, proaches is that they are computationally very ex- Jameel et al.(2018) proposed an approach inspired pensive to train; e.g. the Pair2Vec model is re- by the GloVe word embedding model (Pennington ported to need 7-10 days of training on unspecified et al., 2014) to learn relation vectors based on co- hardware2. In contrast, the approach we propose occurrence statistics between the target word pair in this paper is computationally much simpler, (a, b) and other words. Along similar lines, Es- while resulting in relational word vectors that en- pinosa Anke and Schockaert(2018) learn relation code relational information more accurately than vectors based on the distribution of words occur- those of the Pair2Vec model in lexical semantics ring in sentences that contain a and b by averag- tasks, as we will see in Section5. ing the word vectors of these co-occurring words. Knowledge-Enhanced Word Embeddings. Sev- Then, a conditional autoencoder is used to obtain lower-dimensional relation vectors. 2github.com/mandarjoshi90/pair2vec

3287 eral authors have tried to improve word embed- tors we mainly follow the initialization method of dings by incorporating external knowledge bases. Camacho-Collados et al.(2019,R ELATIVEinit) ex- For example, some authors have proposed models cept for an important difference explained below which combine the loss function of a word em- regarding the symmetry of the relations. Other re- bedding model, to ensure that word vectors are lation embedding methods could be used as well, predictive of their context words, with the loss e.g., (Jameel et al., 2018; Washio and Kato, 2018b; function of a knowledge graph embedding model, Espinosa Anke and Schockaert, 2018; Joshi et al., to encourage the word vectors to additionally be 2019), but this method has the advantage of being predictive of a given set of relational facts (Xu highly efficient. In the following we describe this et al., 2014; Celikyilmaz et al., 2015; Chen et al., procedure for learning relation vectors: we first 2016). Other authors have used knowledge bases explain how a set of potentially related word pairs in a more restricted way, by taking the fact that is selected, and then focus on how relation vectors two words are linked to each other in a given rwv for these word pairs can be learned. knowledge graph as evidence that their word vec- Selecting Related Word Pairs. Starting from a tors should be similar (Faruqui et al., 2015; Speer vocabulary V containing the words of interest (e.g. et al., 2017). Finally, there has also been work that all sufficiently frequent words), as a first step we uses lexicons to learn word embeddings which are need to choose a set R ⊆ V × V of potentially specialized towards certain types of lexical knowl- related words. For each of the word pairs in R edge, such as hypernymy (Nguyen et al., 2017; we will then learn a relation vector, as explained Vulic and Mrksic, 2018), antonymy (Liu et al., below. To select this set R, we only consider 2015; Ono et al., 2015) or a combination of var- word pairs that co-occur in the same sentence in ious linguistic constraints (Mrksiˇ c´ et al., 2017). a given reference corpus. For all such word pairs, Our method differs in two important ways from we then compute their strength of relatedness fol- these existing approaches. First, rather than rely- lowing Levy et al.(2015a) by using a smoothed ing on an external knowledge base, or other forms version of pointwise mutual information (PMI), of supervision, as in e.g. (Chen et al., 2016), our where we use 0.5 as exponent factor. In particular, method is completely unsupervised, as our only for each word w ∈ V, the set R contains all suf- input consists of a . Second, whereas ficiently frequently co-occurring pairs (w, v) for existing work has focused on methods for improv- which v is within the top-100 most closely related ing word embeddings, our aim is to learn vec- words to w, according to the following score: tor representations that are complementary to stan-   dard word embeddings. nwv · s∗∗ PMI0.5(u, v) = log (1) nw∗ · sv∗ 3 Model Description 3 where nwv is the harmonically weighted number We aim to learn representations that are comple- of times the words w and v occur in the same sen- mentary to standard word vectors and are special- tence within a distance of at most 10 words, and: ized towards relational knowledge. To differenti- X 0.5 X ate them from standard word vectors, they will be nw∗ = nwu; sv∗ = nv∗ ; s∗∗ = su∗ referred to as relational word vectors. We write u∈V u∈V ew for the relational word vector representation of w. The main idea of our method is to first learn, for This smoothed variant of PMI has the advantage each pair of closely related words w and v, a rela- of being less biased towards infrequent (and thus typically less informative) words. tion vector rwv that captures how these words are related, which we discuss in Section 3.1. In Sec- Learning Relation Vectors. In this paper, we tion 3.2 we then explain how we learn relational will rely on word vector averaging for learning word vectors from these relation vectors. relation vectors, which has the advantage of be- ing much faster than other existing approaches, 3.1 Unsupervised Relation Vector Learning and thus allows us to consider a higher number Our goal here is to learn relation vectors for of word pairs (or a larger corpus) within a fixed closely related words. For both the selection of the 3A co-occurrence in which there are k words in between 1 vocabulary and the method to learn relation vec- w and v then receives a weight of k+1 . 3288 time budget. Word vector averaging has moreover proven surprisingly effective for learning relation vectors (Weston et al., 2013; Hashimoto et al., 2015; Fan et al., 2015; Espinosa Anke and Schock- aert, 2018), as well as in related tasks such as sen- tence embedding (Wieting et al., 2016). Specifically, to construct the relation vec- tor rwv capturing the relationship between the words w and v we proceed as follows. First, we compute a bag of words representation {(w1, f1), ..., (wn, fn)}, where fi is the number of times the word wi occurs in between the words w and v in any given sentence in the corpus. The re- lation vector rwv is then essentially computed as a weighted average:

n ! Figure 1: Relational word embedding architecture. At X the bottom of the figure, the input layer is constructed rwv = norm fi · wi (2) from the relational word embeddings ew and ev, which i=1 are the vectors to be learnt. As shown at the top, we aim to predict the target relation vector rwv. where we write wi for the vector representation of wi in some given pre-trained word embedding, v and norm(v) = kvk . vectors are symmetric, which makes the vector In contrast to other approaches, we do not dif- addition and component-wise multiplication two ferentiate between sentences where w occurs be- straightforward encoding choices. Figure1 de- fore v and sentences where v occurs before w. picts an overview of the architecture of our model. This means that our relation vectors are symmet- The network is defined as follows: ric in the sense that rwv = rvw. This has the advantage of alleviating sparsity issues. While the iwv = (ew + ev) ⊕ (ew ev) directionality of many relations is important, the hwv = f(Xiwv + a) (3) direction can often be recovered from other infor- owv = f(Yhwv + b) mation we have about the words w and v. For in- stance, knowing that w and v are in a capital-of for some activation function f. We train this net- relationship, it is trivial to derive that “v is the cap- work to predict the relation vector rwv, by mini- ital of w”, rather than the other way around, if we mizing the following loss: also know that w is a country. X  2 L = owv − rwv 3.2 Learning Relational Word Vectors (4) (w,v)∈R The relation vectors rwv capture relational infor- mation about the word pairs in R. The rela- The relational word vectors ew can be initialized tional word vectors will be induced from these using standard word embeddings trained on the relation vectors by encoding the requirement that same corpus. e and e should be predictive of r , for each w v wv 4 Experimental Setting (w, v) ∈ R. To this end, we use a simple neu- ral network with one hidden layer,4 whose input In what follows, we detail the resources and train- is given by (ew + ev) ⊕ (ew ev), where we ing details that we used to obtain the relational write ⊕ for vector concatenation and for the word vectors. component-wise multiplication. Note that the in- Corpus and Word Embeddings. We followed put needs to be symmetric, given that our relation the setting of Joshi et al.(2019) and used the En- 5 4More complex architectures could be used, e.g., (Joshi glish Wikipedia as input corpus. Multiwords (e.g. et al., 2019), but in this case we decided to use a simple archi- Manchester United) were grouped together as a tecture as the main aim of this paper is to encode all relational information into the word vectors, not in the network itself. 5Tokenized and lowercased dump of January 2018.

3289 single token by following the same approach de- onymous, may thus have very dissimilar relational scribed in Mikolov et al.(2013a). As word embed- word vectors. Therefore, we test our proposed dings, we used 300-dimensional FastText vectors models on a number of different supervised tasks (Bojanowski et al., 2017) trained on Wikipedia for which accurately capturing relational informa- with standard hyperparameters. These embed- tion is crucial to improve performance. dings are used as input to construct the relation Comparison systems. Standard FastText vec- 6 vectors rwv (see Section 3.1), which are in turn tors, which were used to construct the relation used to learn relational word embeddings ew (see vectors, are used as our main baseline. In ad- Section 3.2). The FastText vectors are additionally dition, we also compare with the word embed- used as our baseline word embedding model. dings that were learned by the Pair2Vec system7 Word pair vocabulary. As our core vocabulary (see Section2). We furthermore report the results V, we selected the 100, 000 most frequent words of two methods which leverage knowledge bases from Wikipedia. To construct the set of word pairs to enrich FastText word embeddings: Retrofitting R, for each word from V, we selected the 100 (Faruqui et al., 2015) and Attract-Repel (Mrksiˇ c´ most closely related words (cf. Section 3.1), con- et al., 2017). Retrofitting exploits semantic rela- sidering only consider word pairs that co-occur at tions from a knowledge base to re-arrange word least 25 times in the same sentence throughout the vectors of related words such that they become Wikipedia corpus. This process yielded relation closer to each other, whereas Attract-Repel makes vectors for 974,250 word pairs. use of different linguistic constraints to move word Training. To learn our relational word embed- vectors closer together or further apart depend- dings we use the model described in Section 3.2. ing on the constraint. For Retrofitting we make The embedding layer is initialized with the stan- use of WordNet (Fellbaum, 1998) as the input dard FastText 300-dimensional vectors trained on knowledge base, while for Attract-Repel we use Wikipedia. The method was implemented in Py- the default configuration with all constraints from Torch, employing standard hyperparameters, us- PPDB (Pavlick et al., 2015), WordNet and Babel- ing ReLU as the non-linear activation function f Net (Navigli and Ponzetto, 2012). All compari- (Equation3). The hidden layer of the model was son systems are 300-dimensional and trained on fixed to the same dimensionality as the embedding the same Wikipedia corpus. layer (i.e. 600). The stopping criterion was de- 5.1 Relation Classification cided based on a small development set, by set- ting aside 1% of the relation vectors. Code to Given a pre-defined set of relation types and a pair reproduce our experiments, as well as pre-trained of words, the relation classification task consists models and details of the implementation such as in selecting the relation type that best describes other network hyperparameters, are available at the relationship between the two words. As test https://github.com/pedrada88/rwe. sets we used DiffVec (Vylomova et al., 2016) and BLESS8 (Baroni and Lenci, 2011). The DiffVec 5 Experimental Results dataset includes 12,458 word pairs, covering fif- teen relation types including hypernymy, cause- A natural way to assess the quality of word vectors purpose or verb-noun derivations. On the other is to test them in lexical semantics tasks. However, hand, BLESS includes semantic relations such it should be noted that relational word vectors be- as hypernymy, meronymy, and co-hyponymy.9 have differently from standard word vectors, and BLESS includes a train-test partition, with 13,258 we should not expect the relational word vectors and 6,629 word pairs, respectively. This task is to be meaningful in unsupervised tasks such as treated as a multi-class classification problem semantic relatedness (Turney and Pantel, 2010). As a baseline model (Diff), we consider the In particular, note that a high similarity between usual representation of word pairs in terms of their e and e should mean that relationships which w v vector differences (Fu et al., 2014; Roller et al., hold for w have a high probability of holding for v as well. Words which are related, but not syn- 7We used the pre-trained model of its official repository. 8http://clic.cimec.unitn.it/distsem 6We based our implementation to learn relation vec- 9Note that both datasets exhibit overlap in a number of tors on the code available at https://github.com/ relations as some instances from DiffVec were taken from pedrada88/relative BLESS.

3290 DiffVec BLESS Encoding Model Reference Acc. F1 Prec. Rec. Acc. F1 Prec. Rec. RWE (This paper) 85.3 64.2 65.1 64.5 94.3 92.8 93.0 92.6 Pair2Vec (Joshi et al., 2019) 85.0 64.0 65.0 64.5 91.2 89.3 88.9 89.7 Mult+Avg FastText (Bojanowski et al., 2017) 84.2 61.4 62.6 61.9 92.8 90.4 90.7 90.2 Retrofitting† (Faruqui et al., 2015) 86.1* 64.6* 66.6* 64.5* 90.6 88.3 88.1 88.6 Attract-Repel† (Mrksiˇ c´ et al., 2017) 86.0* 64.6* 66.0* 65.2* 91.2 89.0 88.8 89.3 Pair2Vec (Joshi et al., 2019) 84.8 64.1 65.7 64.4 90.9 88.8 88.6 89.1 Mult+Conc FastText (Bojanowski et al., 2017) 84.3 61.3 62.4 61.8 92.9 90.6 90.8 90.4 Diff (only) FastText (Bojanowski et al., 2017) 81.9 57.3 59.3 57.8 88.5 85.4 85.7 85.4

Table 1: Accuracy and macro-averaged F-Measure, precision and recall on BLESS and DiffVec. Models marked with † use external resources. The results with * indicate that WordNet was used for both the development of the model and the construction of the dataset. All models concatenate their encoded representations with the baseline vector difference of standard FastText word embeddings.

2014; Weeds et al., 2014), using FastText word of Retrofitting and Attract-Repel in the BLESS embeddings. Since our goal is to show the com- dataset. For DiffVec, let us recall that both these plementarity of relational word embeddings with approaches have the unfair advantage of having standard word vectors, for our method we con- had WordNet as source knowledge base, used both catenate the difference wj − wi with the vectors to construct the test set and to enhance the word ei + ej and ei · ej (referred to as the Mult+Avg embeddings. In general, the improvement of RWE setting; our method is referred to as RWE). We over standard word embeddings suggests that our use a similar representation for the other methods, vectors capture relations in a way that is compati- simply replacing the relational word vectors by ble to standard word vectors (which will be further the corresponding vectors (but keeping the Fast- discussed in Section 6.2). Text vector difference). We also consider a vari- ant in which the FastText vector difference is con- 5.2 Lexical Feature Modelling catenated with wi + wj and wi · wj, which of- Standard word embedding models tend to cap- fers a more direct comparison with the other meth- ture rather well (Baroni et al., ods. This goes in line with recent works that have 2014; Levy et al., 2015a). However, even though shown how adding complementary features on top other kinds of lexical properties may also be en- of the vector differences, e.g. multiplicative fea- coded (Gupta et al., 2015), they are not explic- tures (Vu and Shwartz, 2018), help improve the itly modeled. Based on the hypothesis that rela- performance. Finally, for completeness, we also tional word embeddings should allow us to model include variants where the average ei + ej is re- such properties in a more consistent and transpar- placed by the concatenation ei ⊕ ej (referred to as ent fashion, we select the well-known McRae Fea- Mult+Conc), which is the encoding considered in ture Norms benchmark (McRae et al., 2005) as Joshi et al.(2019). testbed. This dataset10 is composed of 541 words For these experiments we train a linear SVM (or concepts), each of them associated with one or classifier directly on the word pair encoding, per- more features. For example, ‘a bear is an animal’, forming a 10-fold cross-validation in the case of or ‘a bowl is round’. As for the specifics of our DiffVec, and using the train-test splits of BLESS. evaluation, given that some features are only asso- ciated with a few words, we follow the setting of Results Table1 shows the results of our rela- Rubinstein et al.(2015) and consider the eight fea- tional word vectors, the standard FastText embed- tures with the largest number of associated words. dings and other baselines on the two relation clas- We carry out this evaluation by treating the task sification datasets (i.e. BLESS and DiffVec). Our as a multi-class classification problem, where the model consistently outperforms the FastText em- labels are the word features. As in the previous beddings baseline and comparison systems, with task, we use a linear SVM classifier and perform the only exception being the precision score for 3-fold cross-validation. For each input word, the DiffVec. Despite being completely unsupervised, it is also surprising that our model manages to 10Downloaded from https://sites.google.com/ outperform the knowledge-enhanced embeddings site/kenmcraelab/norms-data

3291 McRae Feature Norms Model QVEC Overall metal is small is large animal is edible wood is round is long RWE 55.2 73.6 46.7 45.9 89.2 61.5 38.5 39.0 46.8 55.4 Pair2Vec 55.0 71.9 49.2 43.3 88.9 68.3 37.7 35.0 45.5 52.7 Retrofitting† 50.6 72.3 44.0 39.1 90.6 75.7 15.4 22.9 44.4 56.8* Attract-Repel† 50.4 73.2 44.4 33.3 88.9 71.8 31.1 24.2 35.9 55.9* FastText 54.6 72.7 48.4 45.2 87.5 63.2 33.3 39.0 47.8 54.6

Table 2: Results on the McRae feature norms dataset (Macro F-Score) and QVEC (correlation score). Models marked with † use external resources. The results with * indicate that WordNet was used for both the development of the model and the construction of the dataset. word embedding of the corresponding feature is which use pairs of related words during training, fed to the classifier concatenated with its baseline may be too na¨ıve to capture the complex relation- FastText embedding. ships proposed in these benchmarks. In fact, they Given that the McRae Feature Norms bench- perform considerably lower than the baseline Fast- mark is focused on nouns, we complement this Text model. On the other hand, Pair2Vec, which experiment with a specific evaluation on verbs. To we recall is the most similar to our model, yields this end, we use the verb set of QVEC11 (Tsvetkov slightly better results than the FastText baseline, et al., 2015), a dataset specifically aimed at mea- but still worse than our relational word embedding suring the degree to which word vectors capture model. This is especially remarkable considering semantic properties which has shown to strongly its much lower computational cost. correlate with performance in downstream tasks As far as the QVEC results are concerned, our such as text categorization and sentiment analy- method is only outperformed by Retrofitting and sis. QVEC was proposed as an intrinsic evalua- Attract-Repel. Nevertheless, the difference is min- tion benchmark for estimating the quality of word imal, which is surprising given that these methods vectors, and in particular whether (and how much) leverage the same WordNet resource which is used they predict lexical properties, such as words be- for the evaluation. longing to one of the fifteen verb supersenses con- tained in WordNet (Miller, 1995). As is custom- 6 Analysis ary in the literature, we compute Pearson cor- To complement the evaluation of our relational relation with respect to these predefined seman- word vectors on lexical semantics tasks, in this tic properties, and measure how well a given set section we provide a qualitative analysis of their of word vectors is able to predict them, with intrinsic properties. higher being better. For this task we compare the 300-dimensional word embeddings of all models 6.1 Word Embeddings: Nearest Neighbours (without concatenating them with standard word First, we provide an analysis based on the nearest embeddings), as the evaluation measure only as- neighbours of selected words in the vector space. sures a fair comparison for word embedding mod- Table4 shows nearest neighbours of our relational els of the same dimensionality. word vectors and the standard FastText embed- dings.13 The table shows that our model captures Results Table2 shows the results on the McRae some subtle properties, which are not normally en- Feature Norms dataset12 and QVEC. In the case coded in knowledge bases. For example, geomet- of the McRae Feature Norms dataset, our rela- ric shapes are clustered together around the sphere tional word embeddings achieve the best overall vector, unlike in FastText, where more loosely re- results, although there is some variation for the in- lated words such as “dimension” are found. This dividual features. These results suggest that attri- trend can easily be observed as well in the philol- butional information is encoded well in our rela- ogy and assassination cases. tional word embeddings. Interestingly, our results In the bottom row, we show cases where rela- also suggest that Retrofitting and Attract-Repel, tional information is somewhat confused with col- 11https://github.com/ytsvetko/qvec 13Recall from Section4 that both were trained on 12Both metal and wood correspond to made of relations. Wikipedia with the same dimensionality, i.e., 300.

3292 SPHERE PHILOLOGY ASSASSINATION DIVERSITY

RWE FastText RWE FastText RWE FastText RWE FastText rectangle spheres metaphysics philological riot assassinate connectedness cultural diversity conic spherical pedagogy philologist premeditate attempt openness diverse hexagon dimension docent literature bombing attempts creativity genetic diversity INTERSECT BEHAVIOUR CAPABILITY EXECUTE

RWE FastText RWE FastText RWE FastText RWE FastText tracks intersection aggressive behaviour refueling capabilities murder execution northbound bisect detrimental behavioural miniaturize capable interrogation executed northwesterly intersectional distasteful misbehaviour positioning survivability incarcerate summarily executed

Table 3: Nearest neighbours for selected words in our relational word embeddings (RWE) and FastText embeddings locationality, leading to undesired clusters, such as which is the ability of our model to capture co- intersect being close in the space with “tracks”, hyponyms as relations, e.g., wrist-knee and anger- or behaviour with “aggressive” or “detrimental”. despair as nearest neighbours of “shoulder-ankle” These examples thus point towards a clear direc- and “shock-grief”, respectively. Finally, one last tion for future work, in terms of explicitly differ- advantage that we highlight is the fact that our entiating collocations from other relationships. model seems to perform implicit disambiguation by balancing a word’s meaning with its paired 6.2 Word Relation Encoding word. For example, the “oct-feb” relation vec- of analogies has proven to tor correctly brings together other month abbrevia- be one of the strongest selling points of word em- tions in our space, whereas in the FastText model, bedding research. Simple vector arithmetic, or its closest neighbour is ‘doppler-wheels’, a rela- pairwise similarities (Levy et al., 2014), can be tion which is clearly related to another sense of used to capture a surprisingly high number of se- oct, namely its use as an acronym to refer to ‘opti- mantic and syntactic relations. We are thus in- cal coherence tomography’ (a type of x-ray proce- terested in exploring semantic clusters as they dure that uses the doppler effect principle). emerge when encoding relations using our rela- tional word vectors. Recall from Section 3.2 that 6.3 Lexical Memorization relations are encoded using addition and point- One of the main problems of word embedding wise multiplication of word vectors. models performing lexical inference (e.g. hyper- Table4 shows, for a small number of selected nymy) is lexical memorization. Levy et al. word pairs, the top nearest neighbors that were (2015b) found that the high performance of super- unique to our 300-dimensional relational word vised distributional models in hypernymy detec- vectors. Specifically, these pairs were not found tion tasks was due to a memorization in the train- among the top 50 nearest neighbors for the Fast- ing set of what they refer to as prototypical hyper- Text word vectors of the same dimensionality, us- nyms. These prototypical hypernyms are general ing the standard vector difference encoding. Sim- categories which are likely to be hypernyms (as ilarly, we also show the top nearest neighbors that occurring frequently in the training set) regardless were unique to the FastText word vector difference of the hyponym. For instance, these models could encoding. As can be observed, our relational word equally predict the pairs dog-animal and screen- embeddings can capture interesting relationships animal as hyponym-hypernym pairs. To measure which go beyond what is purely captured by sim- the extent to which our model is prone to this prob- ilarity. For instance, for the pair “innocent-naive” lem we perform a controlled experiment on the our model includes similar relations such as vain- lexical split of the HyperLex dataset (Vulic´ et al., selfish, honest-hearted or cruel-selfish as nearest 2017). This lexical split does not contain any word neighbours, compared with the nearest neighbours overlap between training and test, and therefore of standard FastText embeddings which are harder constitutes a reliable setting to measure the gener- to interpret. alization capability of embedding models in a con- Interestingly, even though not explicitly en- trolled setting (Shwartz et al., 2016). In HyperLex, coded in our model, the table shows some exam- each pair is provided by a score which measures ples that highlight one property that arises often, the strength of the hypernymy relation.

3293 INNOCENT-NAIVE POLES-SWEDES SHOULDER-ANKLE RWE FastText RWE FastText RWE FastText vain-selfish murder-young lithuanians-germans polish-swedish wrist-knee oblique-ligament honest-hearted imprisonment-term germans-lithuanians poland-sweden thigh-knee pick-ankle injury cruel-selfish conspiracy-minded russians-lithuanians czechoslovakia-sweden neck-knee suffer-ankle injury SHOCK-GRIEF STRENGTHEN-TROPICAL CYCLONE OCT-FEB RWE FastText RWE FastText RWE FastText anger-despair overcome-sorrow intensify-tropical cyclone name-tropical cyclones aug-nov doppler-wheels anger-sorrow overcome-despair weaken-tropical storm bias-tropical cyclones sep-nov scanner-read anger-sadness moment-sadness intensify-tropical storm scheme-tropical cyclones nov-sep ultrasound-baby

Table 4: Three nearest neighbours for selected word pairs using our relational word vector’s relation encoding (RWE) and the standard vector difference encoding of FastText word embeddings. In each column only the word pairs which were on the top 50 NNs of the given model but not in the other are listed. Relations which include one word from the original pair were not considered.

Encoding Model r ρ 7 Conclusions RWE 38.8 38.4 Pair2Vec 28.3 26.5 We have introduced the notion of relational word Mult+Avg FastText 37.2 35.8 vectors, and presented an unsupervised method for Retrofitting† 29.5* 28.9* Attract-Repel† 29.7* 28.9* learning such representations. Parting ways from Pair2Vec 29.8 30.0 previous approaches where relational information Mult+Conc FastText 35.7 33.3 was either encoded in terms of relation vectors Diff (only) FastText 29.9 30.1 (which are highly expressive but can be more dif- ficult to use in applications), represented by trans- r ρ Table 5: Pearson ( ) and Spearman ( ) correlation on forming standard word vectors (which capture re- a subset of the HyperLex lexical split. Models marked with † use external resources. All models concatenate lational information only in a limited way), or by their encoded representations with the baseline vector taking advantage of external knowledge reposito- difference of standard FastText word embeddings. ries, we proposed to learn an unsupervised word embedding model that is tailored specifically to- wards modelling relations. Our model is intended For these experiments we considered the same to capture knowledge which is complementary experimental setting as described in Section4. In to that of standard similarity-centric embeddings, this case we only considered the portion of the Hy- and can thus be used in combination. perLex training and test sets covered in our vocab- We tested the complementarity of our relational ulary14 and used an SVM regression model over word vectors with standard FastText word embed- the word-based encoded representations. Table5 dings on several lexical semantic tasks, capturing shows the results for this experiment. Even though different levels of relational knowledge. The eval- the results are low overall (noting e.g. that results uation indicates that our proposed method indeed for the random split are in some cases above 50% results in representations that capture relational as reported in the literature), our model can clearly knowledge in a more nuanced way. For future generalize better than other models. Interestingly, work, we would be interested in further explor- methods such as Retrofitting and Attract-Repel ing the behavior of neural architectures for NLP perform worse than the FastText vectors. This can tasks which intuitively would benefit from hav- be attributed to the fact that these models have ing access to relational information, e.g., text clas- been mainly tuned towards similarity, which is a sification (Espinosa Anke and Schockaert, 2018; feature that loses relevance in this setting. Like- Camacho-Collados et al., 2019) and other lan- wise, the relation-based embeddings of Pair2Vec guage understanding tasks such as natural lan- do not help, probably due to the high-capacity of guage inference or reading comprehension, in the their model, which makes the word embeddings line of Joshi et al.(2019). less informative. Acknowledgments. Jose Camacho-Collados and 14Recall from Section4 that this vocabulary is shared by Steven Schockaert were supported by ERC Start- all comparison systems. ing Grant 637277.

3294 References Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa, and Yoshimasa Tsuruoka. 2015. Task-oriented Marco Baroni, Georgiana Dinu, and German´ learning of word embeddings for semantic relation Kruszewski. 2014. Don’t count, predict! a classification. In Proceedings of CoNLL, pages systematic comparison of context-counting vs. 268–278. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Shoaib Jameel, Zied Bouraoui, and Steven Schockaert. Computational Linguistics, pages 238–247. 2018. Unsupervised learning of distributional rela- tion vectors. In Proceedings of ACL, pages 23–33. Marco Baroni and Alessandro Lenci. 2011. How we blessed distributional semantic evaluation. In Proc. Mandar Joshi, Eunsol Choi, Omer Levy, Daniel S GEMS Workshop, pages 1–10. Weld, and Luke Zettlemoyer. 2019. pair2vec: Com- positional word-pair embeddings for cross-sentence Piotr Bojanowski, Edouard Grave, Armand Joulin, and inference. In Proceedings of NAACL. Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Associa- Omer Levy, Yoav Goldberg, and Ido Dagan. 2015a. tion for Computational Linguistics, 5. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Zied Bouraoui, Shoaib Jameel, and Steven Schockaert. Association for Computational Linguistics, 3:211– 2018. Relation induction in word embeddings revis- 225. ited. In Proceedings of COLING, pages 1627–1637. Omer Levy, Yoav Goldberg, and Israel Ramat-Gan. Jose Camacho-Collados, Luis Espinosa-Anke, Shoaib 2014. Linguistic regularities in sparse and explicit Jameel, and Steven Schockaert. 2019. A latent vari- word representations. In Proceedings of CoNLL, able model for learning distributional relation vec- pages 171–180. tors. In Proceedings of IJCAI. Omer Levy, Steffen Remus, Chris Biemann, Ido Da- Asli Celikyilmaz, Dilek Hakkani-Tr, Panupong Pasu- gan, and Israel Ramat-Gan. 2015b. Do supervised pat, and Ruhi Sarikaya. 2015. Enriching word em- distributional methods really learn lexical inference beddings using knowledge graph for semantic tag- relations? In Proceedings of NAACL. ging in conversational dialog systems. In AAAI Tal Linzen. 2016. Issues in evaluating semantic spaces Spring Symposium. using word analogies. In Proceedings of the 1st Jiaqiang Chen, Niket Tandon, Charles Darwis Hari- Workshop on Evaluating Vector-Space Representa- man, and Gerard de Melo. 2016. Webbrain: Joint tions for NLP, pages 13–18. neural learning of large-scale commonsense knowl- Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and edge. In Proceedings of ISWC, pages 102–118. Yu Hu. 2015. Learning semantic word embeddings Luis Espinosa Anke and Steven Schockaert. 2018. based on ordinal knowledge constraints. In Pro- SeVeN: Augmenting word embeddings with unsu- ceedings of ACL, pages 1501–1511. pervised relation vectors. In Proceedings of COL- Ken McRae, George S Cree, Mark S Seidenberg, and ING, pages 2653–2665. Chris McNorgan. 2005. Semantic feature produc- tion norms for a large set of living and nonliving Miao Fan, Kai Cao, Yifan He, and Ralph Grishman. things. Behavior research methods, 37(4):547–559. 2015. Jointly embedding relations and mentions for knowledge population. In Proceedings of RANLP, Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. pages 186–191. Corrado, and Jeffrey Dean. 2013a. Distributed rep- resentations of words and phrases and their compo- Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, sitionality. In Proceedings of NIPS, pages 3111– Chris Dyer, Eduard H. Hovy, and Noah A. Smith. 3119. 2015. Retrofitting word vectors to semantic lexi- cons. In Proceedings of NAACL, pages 1606–1615. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space Christiane Fellbaum, editor. 1998. WordNet: An Elec- word representations. In Proceedings of the Annual tronic Database. MIT Press, Cambridge, MA. Conference of the North American Chapter of the Association for Computational Linguistics: Human Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Language Technologies, pages 746–751. Wang, and Ting Liu. 2014. Learning semantic hier- archies via word embeddings. In Proceedings of the George A Miller. 1995. WordNet: A lexical 52nd Annual Meeting of the Association for Compu- database for English. Communications of the ACM, tational Linguistics, pages 1199–1209. 38(11):39–41. Abhijeet Gupta, Gemma Boleda, Marco Baroni, and Nikola Mrksiˇ c,´ Ivan Vulic,´ Diarmuid OS´ eaghdha,´ Ira Sebastian Pado.´ 2015. Distributional vectors encode Leviant, Roi Reichart, Milica Gasiˇ c,´ Anna Korho- referential attributes. In Proceedings of EMNLP, nen, and Steve Young. 2017. Semantic special- pages 12–21. ization of distributional word vector spaces using

3295 monolingual and cross-lingual constraints. Transac- Peter D. Turney. 2005. Measuring semantic similarity tions of the Association of Computational Linguis- by latent relational analysis. In Proceedings of IJ- tics, 5(1):309–324. CAI, pages 1136–1141. Roberto Navigli and Simone Paolo Ponzetto. 2012. Peter D. Turney and P. Pantel. 2010. From frequency to Babelnet: The automatic construction, evaluation meaning: Vector space models of semantics. Jour- and application of a wide-coverage multilingual se- nal of Artificial Intelligence Research, 37:141–188. mantic network. Artificial Intelligence, 193:217– 250. Tu Vu and Vered Shwartz. 2018. Integrating mul- tiplicative features into supervised distributional Kim Anh Nguyen, Maximilian Koper,¨ Sabine Schulte methods for lexical entailment. In Proceedings of im Walde, and Ngoc Thang Vu. 2017. Hierarchical the Seventh Joint Conference on Lexical and Com- embeddings for hypernymy detection and direction- putational Semantics, pages 160–166. ality. In Proceedings of EMNLP, pages 233–243. Ivan Vulic,´ Daniela Gerz, Douwe Kiela, Felix Hill, Masataka Ono, Makoto Miwa, and Yutaka Sasaki. and Anna Korhonen. 2017. Hyperlex: A large-scale 2015. Word embedding-based antonym detection evaluation of graded lexical entailment. Computa- using thesauri and distributional information. In tional Linguistics, 43(4):781–835. Proceedings of NAACL-HLT, pages 984–989. Ivan Vulic and Nikola Mrksic. 2018. Specialising word Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, vectors for lexical entailment. In Proceedings of Benjamin Van Durme, and Chris Callison-Burch. NAACL-HLT, pages 1134–1145. 2015. Ppdb 2.0: Better paraphrase ranking, fine- grained entailment relations, word embeddings, and Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and style classification. In Proceedings of the 53rd An- Timothy Baldwin. 2016. Take and took, gaggle and nual Meeting of the Association for Computational goose, book and read: Evaluating the utility of vec- Linguistics and the 7th International Joint Confer- tor differences for lexical relation learning. In Pro- ence on Natural Language Processing (Volume 2: ceedings of ACL, pages 1671–1682. Short Papers), pages 425–430. Koki Washio and Tsuneaki Kato. 2018a. Filling miss- Jeffrey Pennington, Richard Socher, and Christo- ing paths: Modeling co-occurrences of word pairs pher D. Manning. 2014. GloVe: Global vectors and dependency paths for recognizing lexical se- for word representation. In Proceedings of EMNLP, mantic relations. In Proceedings of the 2018 Con- pages 1532–1543. ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- Anna Rogers, Aleksandr Drozd, and Bofang Li. 2017. guage Technologies, pages 1123–1133. The (too many) problems of analogical reasoning with word vectors. In Proceedings of the 6th Joint Koki Washio and Tsuneaki Kato. 2018b. Neural latent Conference on Lexical and Computational Seman- relational analysis to capture lexical semantic rela- tics (* SEM 2017), pages 135–148. tions in a vector space. In Proceedings of the 2018 Conference on Empirical Methods in Natural Lan- Stephen Roller, Katrin Erk, and Gemma Boleda. 2014. guage Processing, pages 594–600. Inclusive yet selective: Supervised distributional hy- pernymy detection. In Proceedings of COLING, Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, pages 1025–1036. and Bill Keller. 2014. Learning to distinguish hy- pernyms and co-hyponyms. In Proceedings of the Dana Rubinstein, Effi Levi, Roy Schwartz, and Ari 25th International Conference on Computational Rappoport. 2015. How well do distributional mod- Linguistics, pages 2249–2259. els capture different types of semantic knowledge? In Proceedings of ACL, pages 726–730. Jason Weston, Antoine Bordes, Oksana Yakhnenko, and Nicolas Usunier. 2013. Connecting language Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. and knowledge bases with embedding models for re- Improving hypernymy detection with an integrated lation extraction. In Proceedings of EMNLP, pages path-based and distributional method. In Proceed- 1366–1371. ings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 2389–2398. John Wieting, Mohit Bansal, Kevin Gimple, and Karen Livescu. 2016. Towards universal paraphrastic sen- Robyn Speer, Joshua Chin, and Catherine Havasi. tence embeddings. In Proceedings of ICLR. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of AAAI, C. Xu, Y. Bai, J. Bian, B. Gao, G. Wang, X. Liu, and pages 4444–4451. T.-Y. Liu. 2014. RC-NET: A general framework for incorporating knowledge into word representations. Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Guil- In Proc. CIKM, pages 1219–1228. laume Lample, and Chris Dyer. 2015. Evaluation of word vector representations by subspace alignment. In Proceedings of EMNLP, pages 2049–2054.

3296