A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness

A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness George Tsatsaronis and Vicky Panagiotopoulou Department of Informatics Athens University of Economics and Business, 76, Patision Str., Athens, Greece [email protected], [email protected] Abstract et al. (2003) improved performance, while sev- Generalized Vector Space Models eral years before, Krovetz and Croft (1992) had (GVSM) extend the standard Vector already pointed out that resolving word senses can Space Model (VSM) by embedding addi- improve searches requiring high levels of recall. tional types of information, besides terms, In this work, we argue that the incorporation in the representation of documents. An of semantic information into a GVSM retrieval interesting type of information that can model can improve performance by considering be used in such models is semantic infor- the semantic relatedness between the query and mation from word thesauri like WordNet. document terms. The proposed model extends Previous attempts to construct GVSM the traditional VSM with term to term relatedness reported contradicting results. The most measured with the use of WordNet. The success of challenging problem is to incorporate the the method lies in three important factors, which semantic information in a theoretically also constitute the points of our contribution: 1) a sound and rigorous manner and to modify new measure for computing semantic relatedness the standard interpretation of the VSM. between terms which takes into account relation In this paper we present a new GVSM weights, and senses’ depth; 2) a new GVSM re- model that exploits WordNet’s semantic trieval model, which incorporates the aforemen- information. The model is based on a new tioned semantic relatedness measure; 3) exploita- measure of semantic relatedness between tion of all the semantic information a thesaurus terms. Experimental study conducted can offer, including semantic relations crossing in three TREC collections reveals that parts of speech (POS). Experimental evaluation semantic information can boost text in three TREC collections shows that the pro- retrieval performance with the use of the posed model can improve in certain cases the proposed GVSM. performance of the standard TF-IDF VSM. The rest of the paper is organized as follows: Section 1 Introduction 2 presents preliminary concepts, regarding VSM The use of semantic information into text retrieval and GVSM. Section 3 presents the term seman- or text classification has been controversial. For tic relatedness measure and the proposed GVSM. example in Mavroeidis et al. (2005) it was shown Section 4 analyzes the experimental results, and that a GVSM using WordNet (Fellbaum, 1998) Section 5 concludes and gives pointers to future senses and their hypernyms, improves text clas- work. sification performance, especially for small train- ing sets. In contrast, Sanderson (1994) reported 2 Background that even 90% accurate WSD cannot guarantee 2.1 Vector Space Model retrieval improvement, though their experimental methodology was based only on randomly gen- The VSM has been a standard model of represent- erated pseudowords of varying sizes. Similarly, ing documents in information retrieval for almost Voorhees (1993) reported a drop in retrieval per- three decades (Salton and McGill, 1983; Baeza- formance when the retrieval model was based on Yates and Ribeiro-Neto, 1999). Let D be a docu- WSD information. On the contrary, the construc- ment collection and Q the set of queries represent- tion of a sense-based retrieval model by Stokoe ing users’ information needs. Let also ti symbol- Proceedings of the EACL 2009 Student Research Workshop, pages 70–78, Athens, Greece, 2 April 2009. c 2009 Association for Computational Linguistics 70 ize term i used to index the documents in the col- vectors, respectively, as before, a´ki, q´j are the new lection, with i = 1,..,n. The VSM assumes that weights, and n´ the new space dimensions. for each term t there exists a vector t~ in the vector i i n´ n´ ~ ~ space that represents it. It then considers the set of j=1 i=1 a´kiq´jtitj cos(d~k, ~q)= P P (2) all term vectors {t~i} to be the generating set of the n´ 2 n´ 2 q i=1 a´ki j=1 q´j vector space, thus the space basis. If each dk,(for P P k = 1,..,p) denotes a document of the collection, From equation 2 it follows that the term vectors then there exists a linear combination of the term t~i and t~j need not be known, as long as the cor- vectors {t~i} which represents each dk in the vector relations between terms ti and tj are known. If space. Similarly, any query q can be modelled as one assumes pairwise orthogonality, the similarity a vector ~q that is a linear combination of the term measure is reduced to that of equation 1. vectors. 2.3 Semantic Information and GVSM In the standard VSM, the term vectors are considered pairwise orthogonal, meaning that they are Since the introduction of the first GVSM model, linearly independent. But this assumption is un- there are at least two basic directions for em- realistic, since it enforces lack of relatedness be- bedding term to term relatedness, other than ex- tween any pair of terms, whereas the terms in a act keyword matching, into a retrieval model: language often relate to each other. Provided that (a) compute semantic correlations between terms, the orthogonality assumption holds, the similarity or (b) compute frequency co-occurrence statistics between a document vector d~k and a query vec- from large corpora. In this paper we focus on the tor ~q in the VSM can be expressed by the cosine first direction. In the past, the effect of WSD infor- measure given in equation 1. mation in text retrieval was studied (Krovetz and Croft, 1992; Sanderson, 1994), with the results re- n a q vealing that under circumstances, senses informa- ~ j=1 kj j cos(dk, ~q)= P (1) tion may improve IR. More specifically, Krovetz n a2 n q2 qPi=1 ki Pj=1 j and Croft (1992) performed a series of three experiments in two document collections, CACM and where akj, qj are real numbers standing for the TIMES. The results of their experiments showed weights of term j in the document dk and the that word senses provide a clear distinction be- query q respectively. A standard baseline retrieval tween relevant and nonrelevant documents, reject- strategy is to rank the documents according to their ing the null hypothesis that the meaning of a word cosine similarity to the query. is not related to judgments of relevance. Also, they reached the conclusion that words being worth 2.2 Generalized Vector Space Model of disambiguation are either the words with uni- Wong et al. (1987) presented an analysis of the form distribution of senses, or the words that in problems that the pairwise orthogonality assump- the query have a different sense from the most tion of the VSM creates. They were the first to popular one. Sanderson (1994) studied the in- address these problems by expanding the VSM. fluence of disambiguation in IR with the use of They introduced term to term correlations, which pseudowords and he concluded that sense ambi- deprecated the pairwise orthogonality assumption, guity is problematic for IR only in the cases of but they kept the assumption that the term vectors retrieving from short queries. Furthermore, his are linearly independent1, creating the first GVSM findings regarding the WSD used were that such model. More specifically, they considered a new a WSD system would help IR if it could perform space, where each term vector t~i was expressed as with very high accuracy, although his experiments n n a linear combination of 2 vectors ~mr, r = 1..2 . were conducted in the Reuters collection, where The similarity measure between a document and a standard queries with corresponding relevant doc- query then became as shown in equation 2, where uments (qrels) are not provided. n t~i and t~j are now term vectors in a 2 dimensional Since then, several recent approaches have vector space, d~k, ~q are the document and the query incorporated semantic information in VSM. Mavroeidis et al. (2005) created a GVSM ker- 1It is known from Linear Algebra that if every pair of vectors in a set of vectors is orthogonal, then this set of vectors nel based on the use of noun senses, and their is linearly independent, but not the inverse. hypernyms from WordNet. They experimentally 71 showed that this can improve text categorization. stronger edge types. The intuition behind the as- Stokoe et al. (Stokoe et al., 2003) reported an im- sumption of edges’ weighting is the fact that some provement in retrieval performance using a fully edges provide stronger semantic connections than sense-based system. Our approach differs from others. In the next subsection we propose a can- the aforementioned ones in that it expands the didate method of computing weights. The com- VSM model using the semantic information of a pactness of two senses s1 and s2, can take differ- word thesaurus to interpret the orthogonality of ent values for all the different paths that connect terms and to measure semantic relatedness, in- the two senses. All these paths are examined, as stead of directly replacing terms with senses, or explained later, and the path with the maximum adding senses to the model. weight is eventually selected (definition 3). An- other parameter that affects term relatedness is the 3 A GVSM Model based on Semantic depth of the sense nodes comprising the path. A Relatedness of Terms standard means of measuring depth in a word thesaurus is the hypernym/hyponym hierarchical re- Synonymy (many words per sense) and polysemy lation for the noun and adjective POS and hyper- (many senses per word) are two fundamental prob- nym/troponym for the verb POS.

Load more