Term Representation with Generalized Latent Semantic Analysis

Term Representation with Generalized Latent Semantic Analysis Irina Matveeva and Gina-Anne Levow Department of Computer Science, the University of Chicago Chicago, IL 60637 Ayman Farahat and Christiaan Royer Palo Alto Research Center Palo Alto, CA 94304 {matveeva,levow}@cs.uchicago.edu {farahat,royer}@parc.com Abstract has also been a considerable interest in low- Document indexing and representation of term- dimensional document representations. Latent document relations are very important issues Semantic Analysis (LSA) (Deerwester et al. 90) for document clustering and retrieval. In this paper, we present Generalized Latent Seman- is one of the best known dimensionality reduc- tic Analysis as a framework for computing se- tion algorithms used in information retrieval. Its mantically motivated term and document vec- most appealing features are the ability to inter- tors. Our focus on term vectors is motivated by the recent success of co-occurrence based mea- pret the dimensions of the resulting vector space sures of semantic similarity obtained from very as semantic concepts and the fact that the anal- large corpora. Our experiments demonstrate that GLSA term vectors efficiently capture se- ysis of the semantic relatedness between terms is mantic relations between terms and outperform performed implicitly, in the course of a matrix de- related approaches on the synonymy test. composition. LSA often does not perform well on large heterogeneous collections (Ando 00). Dif- 1 Introduction ferent related dimensionality reduction techniques proved successful for document clustering and re- Document indexing and representation of term- trieval (Belkin & Niyogi 03; He et al. 04; Callan document relations are crucial for document et al. 03). classification, clustering and retrieval (Salton & McGill 83; Ponte & Croft 98; Deerwester et al. In this paper, we introduce Generalized Latent 90). Since many classification and categoriza- Semantic Analysis (GLSA) as a framework for tion algorithms require a vector space representa- computing semantically motivated term and doc- tion for the data, it is often important to have a ument vectors. As opposed to LSA and other di- document representation within the vector space mensionality reduction algorithms which are ap- model approach (Salton & McGill 83). In the plied to documents, we focus on computing term traditional bag-of-words representation (Salton & vectors; document vectors are computed as lin- McGill 83) of the document vectors, words repre- ear combinations of term-vectors. Thus, unlike sent orthogonal dimensions which makes an un- LSA (Deerwester et al. 90), Iterative Residual realistic assumption about the independence of Rescaling (Ando 00), Locality Preserving Index- terms within documents. ing (He et al. 04) GLSA is not based on bag-of- Modifications of the representation space, such words document vectors. Instead, we begin with as representing dimensions with distributional semantically motivated pair-wise term similarities term clusters (Bekkerman et al. 03) and expand- to compute a representation for terms. This shift ing the document and query vectors with syn- from dual document-term representation to term onyms and related terms as discussed in (Levow representation has the following motivation. et al. 05), improve the performance on average. Terms offer a much greater flexibility in ex- However, they also introduce some instability and ploring similarity relations than documents. The thus increased variance (Levow et al. 05). The availability of large document collections such as language modelling approach (Salton & McGill the Web offers a great resource for statistical ap- 83; Ponte & Croft 98; Berger & Lafferty 99) used proaches. Recently, co-occurrence based mea- in information retrieval uses bag-of-words docu- sures of semantic similarity between terms have ment vectors to model document and collection been shown to improve performance on such tasks based term distributions. as the synonymy test, taxonomy induction, and Since the document vectors are constructed in document clustering (Turney 01; Terra & Clarke a very high dimensional vocabulary space, there 03; Chklovski & Pantel 04; Widdows 03). On the other hand, many semi-supervised and transduc- similarity between term and document vectors is tive methods based on document vectors cannot used as a measure of semantic association. There- yet handle such large document collections and fore, we would like to obtain term vectors so that take full advantage of this information. their pair-wise cosine similarities correspond to In addition, content bearing words, i.e. words the semantic similarity between the correspond- which convey the most semantic information, are ing vocabulary terms. The extent to which these often combined into semantic classes that corre- latter similarities can be preserved depends on spond to particular activities or relations and con- the dimensionality reduction method. Some tech- tain synonyms and semantically related words. niques aim at preserving all pair-wise similari- Therefore, it seems very natural to represent ties, for example, the singular value decomposi- terms as low dimensional vectors in the space of tion used in this paper. Some graph-based ap- semantic concepts. proaches, on the other hand, preserve the sim- In this paper, we use a large document col- ilarities only locally, between the pairs of most lection to extract point-wise mutual informa- related terms, e.g. Laplacian Eigenmaps Embed- tion, and the singular value decomposition as a ding (Belkin & Niyogi 03), Locality Preserving dimensionality reduction method and compute Indexing (He et al. 04). term vectors. Our experiments show that the The GLSA approach can combine any kind of GLSA term representation outperforms related similarity measure on the space of terms with any approaches on term-based tasks such as the syn- suitable method of dimensionality reduction. The onymy test. traditional term-document matrix is used in the The rest of the paper is organized as follows. last step to provide the weights in the linear com- Section 2 contains the outline of the GLSA algo- bination of term vectors. rithm, and discusses the method of dimensionality In step 2, it is possible to compute the ma- reduction as well as the term association measures trix S for the vocabulary of the large corpus W used in this paper. Section 4 presents our exper- and use the term vectors to represent the docu- iments, followed by conclusion in section 5. ments in C. In addition to being computationally demanding, however, this approach would suffer 2 Generalized Latent Semantic from noise introduced by typos and infrequent Analysis and non-informative words. Finding methods of 2.1 GLSA Framework efficient filtering of the core vocabulary and keep- ing only content bearing words would be another The GLSA algorithm has the following setup. We way of addressing this issue. This is subject of assume that we have a document collection C future work. with vocabulary V . We also have a large Web based corpus W . 2.1.1 Document Vectors 1. Construct the weighted term-document ma- One of the advantages of the term-based GLSA trix D based on C document representation is that it does not have the out-of-sample problem for new documents. It 2. For the vocabulary words in V , obtain a ma- does have this problem for new terms, but new trix of pair-wise similarities, S, using the terms appear at a much lower rate than doc- large corpus W uments. In addition, new rare terms will not 3. Obtain the matrix U T of a low dimensional contribute much to document classification or re- vector space representation of terms that pre- trieval. Since the computation of the term vectors serves the similarities in S, U T ∈ Rk×|V | is done off-line, the GLSA approach would require occasional updates of the term representation. 4. Compute document vectors by taking linear GLSA provides a representation for documents Dˆ U T D combinations of term vectors = that reflects their general semantics. Since GLSA The columns of Dˆ are documents in the k- does not transform the document vectors in the dimensional space. course of computation, the GLSA document rep- The motivation for the condition on the low resentation can be easily extended to contain dimensional representation in step 3 can be ex- more specific information such as presence of plained in the following way. Traditionally, cosine proper names, dates, or numerical information. 2.2 Low-dimensional Representation LSA is one special case within the GLSA frame- 2.2.1 Singular Value Decomposition work. Although it begins with the document- term matrix, it can be shown that LSA uses SVD In this section we outline some of the basic to compute the rank k approximation to a par- properties of the singular value decomposition ticular matrix of pair-wise term similarities. In (SVD) which we use as a method of dimension- the LSA case, these similarities are computed as ality reduction. SVD is applied to the matrix S the inner products between the term vectors in that contains pair-wise similarities between the the space of documents, see (Bartell et al. 92) vocaburaly terms. for details. If the GLSA matrix S is positive First, consider the eigenvalue decomposition of semi-definite, its entries represent inner products S. Since S is a real symmetric matrix, it is diag- between term vectors in a feature space. Thus, onizable, i.e. it can be represented as GLSA with the eigenvalue decomposition can be S = UΣU T interpreted as kernelized LSA, similar to the ker- nel PCA (Schölkopf et al. 98). Since S contains The columns of U are the orthogonal eigenvec- co-occurrence based similarities which have been tors of S. Σ is a diagonal matrix containing the shown to reflect semantic relations between terms, corresponding eigenvalues of S. GLSA uses semantic kernels. If in addition, S is positive semi-definite, it can be represented as a product of two matrices 2.2.2 PMI as Measure of Semantic S = Uˆ UˆT , and in this case Uˆ = UΣ1/2.

Term Representation with Generalized Latent Semantic Analysis

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support