Retrofitting Word Vectors to Semantic Lexicons

Retrofitting Word Vectors to Semantic Lexicons Manaal Faruqui Jesse Dodge Sujay K. Jauhar Chris Dyer Eduard Hovy Noah A. Smith Language Technologies Institute Carnegie Mellon University Pittsburgh, PA, 15213, USA mfaruqui,jessed,sjauhar,cdyer,ehovy,nasmith @cs.cmu.edu { } Abstract ing the quality of vectors. Semantic lexicons, which provide type-level information about the semantics Vector space word representations are learned of words, typically by identifying synonymy, hyper- from distributional information of words in nymy, hyponymy, and paraphrase relations should large corpora. Although such statistics are semantically informative, they disregard the be a valuable resource for improving the quality of valuable information that is contained in se- word vectors that are trained solely on unlabeled mantic lexicons such as WordNet, FrameNet, corpora. Examples of such resources include Word- and the Paraphrase Database. This paper Net (Miller, 1995), FrameNet (Baker et al., 1998) proposes a method for refining vector space and the Paraphrase Database (Ganitkevitch et al., representations using relational information 2013). from semantic lexicons by encouraging linked words to have similar vector representations, Recent work has shown that by either changing and it makes no assumptions about how the in- the objective of the word vector training algorithm put vectors were constructed. Evaluated on a in neural language models (Yu and Dredze, 2014; battery of standard lexical semantic evaluation Xu et al., 2014; Bian et al., 2014; Fried and Duh, tasks in several languages, we obtain substan- 2014) or by relation-specific augmentation of the tial improvements starting with a variety of cooccurence matrix in spectral word vector models word vector models. Our refinement method to incorporate semantic knowledge (Yih et al., 2012; outperforms prior techniques for incorporat- Chang et al., 2013), the quality of word vectors can ing semantic lexicons into word vector training algorithms. be improved. However, these methods are limited to particular methods for constructing vectors. The contribution of this paper is a graph-based 1 Introduction learning technique for using lexical relational re- Data-driven learning of word vectors that capture sources to obtain higher quality semantic vectors, lexico-semantic information is a technique of cen- which we call “retrofitting.” In contrast to previ- tral importance in NLP. These word vectors can ous work, retrofitting is applied as a post-processing in turn be used for identifying semantically related step by running belief propagation on a graph con- word pairs (Turney, 2006; Agirre et al., 2009) or structed from lexicon-derived relational information as features in downstream text processing applica- to update word vectors ( 2). This allows retrofitting § tions (Turian et al., 2010; Guo et al., 2014). A vari- to be used on pre-trained word vectors obtained ety of approaches for constructing vector space em- using any vector training model. Intuitively, our beddings of vocabularies are in use, notably includ- method encourages the new vectors to be (i) simi- ing taking low rank approximations of cooccurrence lar to the vectors of related word types and (ii) simi- statistics (Deerwester et al., 1990) and using internal lar to their purely distributional representations. The representations from neural network models of word retrofitting process is fast, taking about 5 seconds for sequences (Collobert and Weston, 2008). a graph of 100,000 words and vector length 300, and Because of their value as lexical semantic repre- its runtime is independent of the original word vec- sentations, there has been much research on improv- tor training model. 1606 Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 1606–1615, Denver, Colorado, May 31 – June 5, 2015. c 2015 Association for Computational Linguistics tors to be retrofitted (and correspond to VΩ); shaded nodes are labeled with the corresponding vectors in Qˆ, which are observed. The graph can be interpreted as a Markov random field (Kindermann and Snell, 1980). The distance between a pair of vectors is defined to be the Euclidean distance. Since we want the inferred word vector to be close to the observed value qˆ and close to its neighbors q , j such that i j ∀ (i, j) E, the objective to be minimized becomes: ∈ Figure 1: Word graph with edges between related words n showing the observed (grey) and the inferred (white) 2 2 Ψ(Q) = αi qi qî + βij qi qj word vector representations. k − k k − k i=1 (i,j) E X X∈ Experimentally, we show that our method works where α and β values control the relative strengths of associations (more details in 6.1). well with different state-of-the-art word vector mod- § els, using different kinds of semantic lexicons and In this case, we first train the word vectors inde- gives substantial improvements on a variety of pendent of the information in the semantic lexicons benchmarks, while beating the current state-of-the- and then retrofit them. Ψ is convex in Q and its so- art approaches for incorporating semantic informa- lution can be found by solving a system of linear tion in vector training and trivially extends to mul- equations. To do so, we use an efficient iterative tiple languages. We show that retrofitting gives updating method (Bengio et al., 2006; Subramanya consistent improvement in performance on evalua- et al., 2010; Das and Petrov, 2011; Das and Smith, tion benchmarks with different word vector lengths 2011). The vectors in Q are initialized to be equal ˆ and show a qualitative visualization of the effect of to the vectors in Q. We take the first derivative of Ψ retrofitting on word vector quality. The retrofitting with respect to one qi vector, and by equating it to tool is available at: https://github.com/ zero arrive at the following online update: mfaruqui/retrofitting. j:(i,j) E βijqj + αiqî qi = ∈ (1) 2 Retrofitting with Semantic Lexicons j:(i,j) E βij + αi P ∈ Let V = w1, . , wn be a vocabulary, i.e, the set P { } In practice, running this procedure for 10 iterations Ω of word types, and be an ontology that encodes se- converges to changes in Euclidean distance of ad- mantic relations between words in V . We represent 2 jacent vertices of less than 10− . The retrofitting Ω (V, E) as an undirected graph with one vertex for approach described above is modular; it can be ap- each word type and edges (w , w ) E V V i j ∈ ⊆ × plied to word vector representations obtained from indicating a semantic relationship of interest. These any model as the updates in Eq. 1 are agnostic to the relations differ for different semantic lexicons and original vector training model objective. are described later ( 4). § The matrix Qˆ will be the collection of vector rep- Semantic Lexicons during Learning. Our pro- d resentations qî R , for each wi V , learned posed approach is reminiscent of recent work on ∈ ∈ using a standard data-driven technique, where d is improving word vectors using lexical resources (Yu the length of the word vectors. Our objective is and Dredze, 2014; Bian et al., 2014; Xu et al., 2014) to learn the matrix Q = (q1, . , qn) such that the which alters the learning objective of the original columns are both close (under a distance metric) to vector training model with a prior (or a regularizer) their counterparts in Qˆ and to adjacent vertices in Ω. that encourages semantically related vectors (in Ω) Figure 1 shows a small word graph with such edge to be close together, except that our technique is ap- connections; white nodes are labeled with the Q vec- plied as a second stage of learning. We describe the 1607 prior approach here since it will serve as a baseline. Lexicon Words Edges Here semantic lexicons play the role of a prior on Q PPDB 102,902 374,555 which we define as follows: WordNetsyn 148,730 304,856 WordNetall 148,730 934,705 n FrameNet 10,822 417,456 p(Q) exp γ β q q 2 ∝ − ijk i − jk Table 1: Approximate size of the graphs obtained from i=1 j:(i,j) E X X∈ different lexicons. (2) γ Here, is a hyperparameter that controls the 1 strength of the prior. As in the retrofitting objec- and are of length 300. tive, this prior on the word vector parameters forces Skip-Gram Vectors (SG). The word2vec words connected in the lexicon to have close vec- tool (Mikolov et al., 2013a) is fast and currently in ˆ tor representations as did Ψ(Q) (with the role of Q wide use. In this model, each word’s Huffman code being played by cross entropy of the empirical dis- is used as an input to a log-linear classifier with tribution). a continuous projection layer and words within a This prior can be incorporated during learn- given context window are predicted. The available ing through maximum a posteriori (MAP) estima- vectors are trained on 100 billion words of Google tion. Since there is no closed form solution of news dataset and are of length 300.2 the estimate, we consider two iterative procedures. In the first, we use the sum of gradients of the Global Context Vectors (GC). These vectors are log-likelihood (given by the extant vector learning learned using a recursive neural network that incor- model) and the log-prior (from Eq. 2), with respect porates both local and global (document-level) con- to Q for learning. Since computing the gradient of text features (Huang et al., 2012). These vectors Eq. 2 has linear runtime in the vocabulary size n, we were trained on the first 1 billion words of English 3 use lazy updates (Carpenter, 2008) for every k words Wikipedia and are of length 50.

Load more