Latent Semantic Indexing: an Overview

Latent Semantic Indexing: An Overview Latent Semantic Indexing: An overview INFOSYS 240 Spring 2000 Final Paper Barbara Rosario 1) Introduction Typically, information is retrieved by literally matching terms in documents with those of a query. However, lexical matching methods can be inaccurate when they are used to match a user's query. Since there are usually many ways to express a given concept (synonymy), the literal terms in a user's query may not match those of a relevant document. In addition, most words have multiple meanings (polysemy), so terms in a user's query will literally match terms in irrelevant documents. A better approach would allow users to retrieve information on the basis of a conceptual topic or meaning of a document. Latent Semantic Indexing (LSI) [Deerwester et al] tries to overcome the problems of lexical matching by using statistically derived conceptual indices instead of individual words for retrieval. LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice. A truncated singular value decomposition (SVD) is used to estimate the structure in word usage across documents. Retrieval is then performed using the database of singular values and vectors obtained from the truncated SVD. Performance data shows that these statistically derived vectors are more robust indicators of meaning than individual terms. Section 2 is a review of basic concepts needed to understand LSI. In Section 3, a description of some of the advantages and disadvantages of LSI. The effectiveness of LSI has been demonstrated empirically in several text collections as increased average retrieval precision but a theoretical (and quantitative) understanding beyond empirical evidence is desirable. Section 4 describes some of the attempts that have been done in this direction. Finally, in Section 5 some applications of LSI. 1 Latent Semantic Indexing: An Overview 2) Basic concepts Latent Semantic Indexing is a technique that projects queries and documents into a space with “latent” semantic dimensions. In the latent semantic space, a query and a document can have high cosine similarity even if they do not share any terms - as long as their terms are semantically similar in a sense to be described later. We can look at LSI as a similarity metric that is an alternative to word overlap measures like tf.idf. The latent semantic space that we project into has fewer dimensions than the original space (which has as many dimensions as terms). LSI is thus a method for dimensionality reduction. A dimensionality reduction technique takes a set of objects that exist in a high-dimensional space and represents them in a low- dimensional space, often in a two-dimensional or three-dimensional space for the purpose of visualization. Latent semantic indexing is the application of a particular mathematical technique, called Singular Value Decomposition or SVD, to a word-by-document matrix. SVD (and hence LSI) is a least-squares method. The projection into the latent semantic space is chosen such that the representations in the original space are changed as little as possible when measured by the sum of the squares of the differences. SVD takes a matrix A and represents it as Aˆ in a lower dimensional space such that the “distance” between the two matrices as measured by the 2-norm is minimized: ∆ = A − Aˆ 2 The 2-norm for matrices is the equivalent of Euclidean distance for vectors. SVD project an n-dimensional space onto a k-dimensional space where n > > k. In our application (word-document matrices), n is the number of word types in the collection. Values of k that are frequently chosen are 100 and 150. The projection transforms a document's vector in n-dimensional word space into a vector in the k-dimensional reduced space. There are many different mappings from high dimensional to low-dimensional spaces. Latent Semantic Indexing chooses the mapping that is optimal in the sense that it minimizes the distance ∆ . This setup has the consequence that the 1 dimensions of the reduced space correspond to the axes of greatest variation. 1 This is closely related to Principal Componet Analysis (PCA), another technique for dimensionality reduction. One difference between the two techniques is that PCA can only be applied to a square matrix whereas LSI can be applied to any matrix. 2 Latent Semantic Indexing: An Overview The SVD projection is computed by decomposing the document-by-term A T S D matrix t×d into the product of three matrices, t×n , n×n , d×n : = T At×d Tt×n Sn×n (Dd×n ) = where t is the number of terms, d is the number of documents, n min(t,d), T T = T and D have orthonormal columns, i.e. TT = D D = I , rank(A) r , = σ σ σ σ > 0 for 1 ≤ i ≤ r,σ = 0 for j ≥ r +1 S diag( 1, 2 ,..., n ,) , i j . We can view SVD as a method for rotating the axes of the n-dimensional space such that the first axis runs along the direction of largest variation among the documents, the second dimension runs along the direction with the second largest variation and so forth. The matrices T and D represent terms and documents in this new space. The diagonal matrix S contains the singular values of A in th descending order. The i singular value indicates the amount of variation along th the i axis. By restricting the matrixes T, S and D to their first k < n rows one obtains T , S ,(D )T the matrixes t×k k×k d×k . Their product Aˆ Aˆ = T S (D )T t×k t×k k×k d×k is the best square approximation of A by a matrix of rank k in the sense defined in ∆ = A − Aˆ the equation 2 . Choosing the number of dimensions (k) for Aˆ is an interesting problem. While a reduction in k can remove much of the noise, keeping too few dimensions or factors may loose important information. As discussed in [Deerwester et al] using a test database of medical abstracts, LSI performance can improve considerably after 10 or 20 dimensions, peaks between 70 and 100 dimensions, and then begins to diminish slowly. This pattern of performance (initial large increase and slow decrease to word-based performance) is observed with other datasets as well. Eventually performance must approach the level of performance attained by standard vector methods, since with k = n factors Aˆ will exactly reconstruct the original term by document matrix A. That LSI works well with a relatively small (compared to the number of unique terms) number of dimensions or factors k shows that these dimensions are, in fact, capturing a major portion of the meaningful structure. [Berry et al.] 3 Latent Semantic Indexing: An Overview One can also prove that SVD is unique, that is, there is only one possible decomposition of a given matrix. That SVD finds the optimal projection to a low- dimensional space is the key property for exploiting word co-occurrence patterns. It is important for the LSI method that the derived Aˆ matrix does not reconstruct the original term document matrix A exactly. The truncated SVD, in one sense, captures most of the important underlying structure in the association of terms and documents, yet at the same time removes the noise or variability in word usage that plagues word-based retrieval methods. Intuitively, since the number of dimensions, k, is much smaller than the number of unique terms, t, minor differences in terminology will be ignored. Terms which occur in similar documents, for example, will be near each other in the k-dimensional factor space even if they never co-occur in the same document. This means that some documents, which do not share any words with a user’s query, may nonetheless be near it in k-space. This derived representation, which captures term-term associations, is used for retrieval. 2.1) Queries For purposes of information retrieval, a user's query must be represented as a vector in k-dimensional space and compared to documents. A query (like a document) is a set of words. For example, the user query can be represented by T −1 = × qˆ q Tt×k S k k where q is simply the vector of words in the users query, multiplied by the appropriate term weights. The sum of these k-dimensional terms vectors is qTT reflected by the term t×k in the above equation, and the right multiplication by −1 S k×k differentially weights the separate dimensions. Thus, the query vector is located at the weighted sum of its constituent term vectors. The query vector can then be compared to all existing document vectors, and the documents ranked by their similarity (nearness) to the query. One common measure of similarity is the cosine between the query vector and document vector. Typically, the z closest documents or all documents exceeding some cosine threshold are returned to the user. 4 Latent Semantic Indexing: An Overview 2.2) Updating One remaining problem for a practical application is how to fold queries and new documents into the reduced space. The SVD computation only gives us reduced representations for the document vectors in matrix A. We do not want to do a completely new SVD every time a new query is launched or new documents and terms are added to the collection. There are two alternatives for incorporating new documents and terms currently: recomputing the SVD of a new term-document matrix or folding-in the new terms and documents. Lets define some terms that are used when discussing updating.

Load more