Multi-Relational Latent Semantic Analysis

Multi-Relational Latent Semantic Analysis Kai-Wei Chang∗ Wen-tau Yih Christopher Meek University of Illinois Microsoft Research Urbana, IL 61801, USA Redmond, WA 98052, USA [email protected] {scottyih,meek}@microsoft.com Abstract similarity (Deerwester et al., 1990) and relational similarity (Turney, 2006; Zhila et al., 2013; Mikolov We present Multi-Relational Latent Seman- et al., 2013). In many of these applications, La- tic Analysis (MRLSA) which generalizes La- tent Semantic Analysis (LSA) (Deerwester et al., tent Semantic Analysis (LSA). MRLSA pro- 1990) has been widely used, serving as a fundamen- vides an elegant approach to combining mul- tal component or as a strong baseline. tiple relations between words by construct- ing a 3-way tensor. Similar to LSA, a low- LSA operates by mapping text objects, typically rank approximation of the tensor is derived documents and words, to a latent semantic space. using a tensor decomposition. Each word in The proximity of the vectors in this space implies the vocabulary is thus represented by a vec- that the original text objects are semantically re- tor in the latent semantic space and each related. However, one well-known limitation of LSA lation is captured by a latent square matrix. is that it is unable to differentiate fine-grained re- The degree of two words having a specific relation can then be measured through sim- lations. For instance, when applied to lexical se- ple linear algebraic operations. We demon- mantics, synonyms and antonyms may both be as- strate that by integrating multiple relations signed high similarity scores (Landauer and Laham, from both homogeneous and heterogeneous 1998; Landauer, 2002). Asymmetric relations like information sources, MRLSA achieves state- hyponyms and hypernyms also cannot be differenti- of-the-art performance on existing benchmark ated. Although there exists some recent work, such datasets for two relations, antonymy and is-a. as PILSA which tries to overcome this weakness of LSA by introducing the notion of polarity (Yih 1 Introduction et al., 2012). This extension, however, can only handle two opposing relations (e.g., synonyms and Continuous semantic space representations have antonyms), leaving open the challenge of encoding proven successful in a wide variety of NLP and IR multiple relations. applications, such as document clustering (Xu et al., In this paper, we propose Multi-Relational Latent 2003) and cross-lingual document retrieval (Dumais Semantic Analysis (MRLSA), which strictly gener- et al., 1997; Platt et al., 2010) at the document level alizes LSA to incorporate information of multiple and sentential semantics (Guo and Diab, 2012; Guo relations concurrently. Similar to LSA or PILSA and Diab, 2013) and syntactic parsing (Socher et when applied to lexical semantics, each word is still al., 2013) at the sentence level. Such representa- mapped to a vector in the latent space. However, tions also play an important role in applications for when measuring whether two words have a specific lexical semantics, such as word sense disambigua- relation (e.g., antonymy or is-a), the word vectors tion (Boyd-Graber et al., 2007), measuring word will be mapped to a new space according to the rela- ∗Work conducted while interning at Microsoft Research. tion where the degree of having this relation will be 1602 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1602–1612, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics judged by cosine similarity. The raw data construc- 2 Related Work tion in MRLSA is straightforward and similar to the document-term matrix in LSA. However, instead of MRLSA can be viewed as a model that derives gen- using one matrix to capture all relations, we extend eral continuous space representations for capturing the representation to a 3-way tensor. Each slice cor- lexical semantics, with the help of tensor decompo- responds to the document-term matrix in the original sition techniques. We highlight some recent work LSA design but for a specific relation. Analogous to related to our approach. LSA, the whole linear transformation mapping is de- The most commonly used continuous space rep- rived through tensor decomposition, which provides resentation of text is arguably the vector space a low-rank approximation of the original tensor. As model (VSM) (Turney and Pantel, 2010). In this a result, previously unseen relations between two representation, each text object can be represented words can be discovered, and the information en- by a high-dimensional sparse vector, such as a coded in other relations can influence the construc- term-vector or a document-vector that denotes the tion of the latent representations, and thus poten- statistics of term occurrences (Salton et al., 1975) tially improves the overall quality. In addition, the in a large corpus. The text can also be repre- information in different slices can come from het- sented by a low-dimensional dense vector derived erogeneous sources (conceptually similar to (Riedel by linear projection models like latent semantic et al., 2013)), which not only improves the model, analysis (LSA) (Deerwester et al., 1990), by dis- but also extends the word coverage in a reliable way. criminative learning methods like Siamese neural networks (Yih et al., 2011), recurrent neural networks (Mikolov et al., 2013) and recursive neu- We provide empirical evidence that MRLSA is ef- ral networks (Socher et al., 2011), or by graphical fective using two different word relations: antonymy models such as probabilistic latent semantic anal- and is-a. We use the benchmark GRE test of closest- ysis (PLSA) (Hofmann, 1999) and latent Dirichlet opposites (Mohammad et al., 2008) to show that allocation (LDA) (Blei et al., 2003). As a general- MRLSA performs comparably to PILSA, which was ization of LSA, MRLSA is also a linear projection the pervious state-of-the-art approach on this prob- model. However, while the words are represented lem, when given the same amount of information. In by vectors as well, multiple relations between words addition, when other words and relations are avail- are captured separately by matrices. able, potentially from additional resources, MRLSA In the context of lexical semantics, VSMs provide is able to outperform previous methods significantly. a natural way of measuring semantic word related- We use the is-a relation to demonstrate that MRLSA ness by computing the distance between the cor- is capable of handling asymmetric relations. We responding vectors, which has been a standard ap- take the list of word pairs from the Class-Inclusion proach (Agirre et al., 2009; Reisinger and Mooney, (i.e., is-a) relations in SemEval-2012 Task 2 (Jur- 2010; Yih and Qazvinian, 2012). These approaches gens et al., 2012), and use our model to measure the do not apply directly to the problem of modeling degree of two words have this relation. The mea- other types of relations. Existing methods that do sures derived from our model correlate with human handle multiple relations often use a model com- judgement better than the best system that partici- bination scheme to integrate signals from various pated in the task. types of information sources. For instance, mor- phological variations discovered from the Google The rest of this paper is organized as follows. We n-gram corpus have been combined with informa- first survey some related work in Section 2, followed tion from thesauri and vector-based word related- by a more detailed description of LSA and PILSA ness models for detecting antonyms (Mohammad et in Section 3. Our proposed model, MRLSA, is pre- al., 2008). An alternative approach proposed by Tur- sented in Section 4. Section 5 presents our experi- ney (2008) that handles synonyms, antonyms and mental results. Finally, Section 6 concludes the pa- associations is to use a uniform approach by first per. reducing the problem to determining whether two 1603 pairs of words can be analogous, and then predicting it using a supervised model with features based on the frequencies of patterns in the corpus. Similarly, W X = U VT to measure whether two word pairs have the same relation, Zhila et al. (2013) proposed to combine heterogeneous models, which achieved state-of-the-art performance. In comparison, MRLSA models mul- Figure 1: SVD applied to a d×n document-term ma- tiple lexical relations holistically. The degree that trix W. The rank-k approximation, X, is the mul- T two words having a particular relation is estimated tiplication of U, Σ and V , where U and V are using the same linear function of the corresponding d × k and n × k orthonormal matrices and Σ is a T vectors and matrix. k × k diagonal matrix. The column vectors of V Tensor decomposition generalizes matrix factor- multiplied by the singular values Σ represent words ization and has been applied to several NLP applica- in the latent semantic space. tions recently. For example, Cohen et al. (2013) proposed an approximation algorithm for PCFG pars- to various texts including news articles, sentences ing that relies on Kruskal decomposition. Van de and bags of words. Once the matrix is constructed, Cruys et al. (2013) modeled the composition of the second step is to apply singular value decom- subject-verb-object triples using Tucker decompo- position (SVD) to W in order to derive a low-rank sition, which results in a better similarity measure approximation. To have a rank-k approximation, X for transitive phrases. Similar to this construction is the reconstruction matrix of W, defined as but used in the community-based question answer- ing (CQA) scenario, Qiu et al. (2013) represented W ≈ X = UΣVT triples of question title, question content and answer (1) as a tensor and applied 3-mode SVD to derive latent semantic representations for question matching.

Load more