Joining Automatic Query Expansion Based on Thesaurus and Word Sense Disambiguation Using Wordnet
Total Page:16
File Type:pdf, Size:1020Kb
Int. J. Computer Applications in Technology, Vol. 33, No. 4, 2008 271 Joining automatic query expansion based on thesaurus and word sense disambiguation using WordNet Francisco João Pinto* and Antonio Fariña Martínez Department of Computer Science, University of A Coruña, Campus de Elviña s/n, A Coruña, 15071, Spain E-mail: [email protected] E-mail: [email protected] *Corresponding author Carme Fernández Pérez-Sanjulián Department of Galician-Portuguese, French, and Linguistics, University of A Coruña, Campus da Zapateira s/n, A Coruña, 15071, Spain E-mail: [email protected] Abstract: The selection of the most appropriate sense of an ambiguous word in a certain context is one of the main problems in Information Retrieval (IR). For this task, it is usually necessary to count on a semantic source, that is, linguistic resources like dictionaries, thesaurus, etc. Using a methodology based on simulation under a vector space model, we show that the use of automatic query expansion and disambiguation of the sense of the words permits to improve retrieval effectiveness. As shown in our experiments, query expansion is not able by itself to improve retrieval. However, when it is combined with Word Sense Disambiguation (WSD), that is, when the correct meaning of a word is chosen from among all its possible variations, it leads to effectiveness improvements. Keywords: automatic query expansion; thesaurus; disambiguation; WordNet. Reference to this paper should be made as follows: Pinto, F.J., Martínez, A.F. and Pérez-Sanjulián, C.F. (2008) ‘Joining automatic query expansion based on thesaurus and word sense disambiguation using WordNet’, Int. J. Computer Applications in Technology, Vol. 33, No. 4, pp.271–279. Biographical notes: Francisco João Pinto received his MSc Degree in Mathematics/Computer Science from the Instituto Superior José Varona (Cuba) in 1992. With the support of the Government of Angola he moved to Spain, and in 2008, he obtained his PhD in Computer Science from the University of A Coruña. His research interests focus in information retrieval, and more precisely they include expanding natural language queries by means of word disambiguation, which is of utmost importance to integrate typical web search engines into automated processes. Antonio Fariña Martínez earned his PhD in Computer Science in 2005 from the University of A Coruña, where he is nowadays an Associate Professor from the Computer Science Department. His research is mainly focused in text retrieval on natural language text collections. Actually, his main contributions to this field are related to text compression, as well as to compressed indexing structures. Other areas of interest are: algorithms, searches in metric spaces, and geographic information systems. He has been member of the Program Committee of several Conferences (MDMM, ICEIS, JIDEE). He has also been external reviewer of various journals (Computer Journal, IJCSSE, JRPIT, …) and international conferences (ESA, CIKM, PSI, VODCA, MICAI). Carme Fernández Pérez-Sanjulián obtained her PhD in Hispanic Philology (Galician-Portuguese Section) from the University of A Coruña in 2001. She is an Associate Professor of the University of A Coruña. Apart from her interest in topics related to Galician (and Portuguese) Literature, she has been an active researcher in different works related to the use of new technologies for creating and spreading cultural resources, specially focused in the creation of Digital Libraries. In this area she has participated in several research and development projects, such as the creation of the Galician Virtual Library. Copyright © 2008 Inderscience Enterprises Ltd. 272 F.J. Pinto et al. 1 Introduction In Section 2, we introduce both query expansion and Word Sense Disambiguation (WSD), and some previous In the last decades the amount of data available in digital works that make use of linguistic information to expand form has been growing exponentially. In this scenario, queries are discussed. In Section 3, the vector model used the management of such a large amount of data requires not is presented. We show how the measurement of the only dealing with the problems that arise from the big needs similarity between a query and the documents in a of storage media, but also the development of structures that collection can be done. This measurement will be useful to permit the organisation of the data in such a way that the determine the ordering of the relevant documents. WordNet information contained in an information system can be is described briefly in Section 4. In order to perform retrieved effectively. Information Retrieval (IR) covers experiments, an application that permits the formulation of the representation, storage, organisation, and access to the queries was created. It is shown in Section 5, where we also information of such systems. Basically, the representation describe how those experiments will be run. Next section and organisation of the data must provide an easy form for presents the experimental results obtained. Finally, accessing the bits of information in which a user may be the conclusions of our work are discussed in Section 7. interested (Baeza-Yates and Ribero, 1999). On the one hand, when a user aims at recovering some information from an information system, he has to be able to translate 2 Using WordNet in query expansion and his needs of information into a query that can be processed word sense disambiguation by a search engine. This query is usually composed of several keywords that cover a user’s needs. On the other Users using a retrieval system that takes the coincidence hand, the main objective of an IR system is to recover that of words as the base to recover a document face the information that may be of interest or relevant given a user challenge of expressing their queries with words that are query. included in the vocabularies of the documents that they It is important to notice the differences between the are interested in. This is a problematic issue in large text two concepts of data retrieval and information retrieval. databases because they usually contain many different Data retrieval consists basically of identifying those expressions that refer to the same concept. Nevertheless, documents from among a collection of documents that the ability to recover documents from these large databases contain the keywords requested by a user. However, is crucial in many different scenarios; documentation stored recovering those documents is not enough to satisfy in legal databases that are used in a trial, corporative the information needs of the user. In fact, when a user databases, news clipping and the recovery of articles from introduces some keywords, it is usually interested in a digital newspaper, finding relevant passages within retrieving data regarding some subjects, not only in just a complete set of manuals of a complex system for a retrieving documents, including a set of keywords. particular problem, etc. A way to help a user when some Therefore, it is necessary that, given a query, an information information is requested through the formulation of a query retrieval system can try to recover documents containing not is that the retrieval system can manage to expand the query only the few terms given by a user, but also an expanded set automatically. That is, that some terms related to the words of terms related to the former ones. Even though there are provided by the user are also searched for during the information retrieval systems that keep well-structured data retrieval process. These new terms that can be statistically with well-defined semantics (for example, a relational related to the original words of the query (for example by database), an information retrieval system has to usually focusing in terms that usually occur in the same documents), deal with non-structured and ambiguous data. For example, or can be chosen with the aid of linguistic resources. this is the scenario of an information retrieval system that The use of statistical information based on the deals with natural language text. The former systems co-occurrence of terms to expand queries is attractive can usually answer queries in a very accurate form. because the relationships between terms can be obtained In the second case, retrieval could become less precise, and from the documents in the collection. Therefore, the some documents from the retrieved set of documents, might necessity of using linguistic resources or other tools that can be useless. be difficult to maintain can be avoided. Unfortunately, An information retrieval system must, somehow, be able the success of co-occurrence-based methods as a way to identify the information stored in any document. This is a to improve retrieval effectiveness has been rather poor when key property to estimate the relevance of any document with no extra relevance data is available. Other techniques like respect to users’ queries and, consequently, to permit us the Latent Semantic Indexing (Deerwester et al., 1990) that to rank the retrieved documents depending on that value. are based on the use of statistical relations between terms Therefore, an information retrieval system has not only but do not actually expand queries have proven to obtain to organise information in such a way that an efficient good results. retrieval can be performed, but it also has to be able to The use of linguistic information as the source of related recover all the relevant documents for a query. At the same terms has been applied successfully in some small time, the retrieval system should retrieve as few documents experiments. In Salton and Lesk (1971) they found that as possible, so that it obtains high precision values. synonym-based query expansion improves the results Joining automatic query expansion based on thesaurus and word sense disambiguation using WordNet 273 obtained. However, when dealing with large collections, In addition, the expansion procedure permitted using the query expansion performed by choosing either more different parameters to facilitate the comparison of the general terms or more specific terms extracted from effectiveness of such schemes.