Creating and Using Semantics for Information Retrieval and Filtering
Total Page:16
File Type:pdf, Size:1020Kb
Creating and Using Semantics for Information Retrieval and Filtering State of the Art and Future Research Several experiments have been carried out in the last 15 years investigating the use of various resources and techniques (e.g., thesauri, synonyms, word sense disambiguation, etc.) to help refine or enhance queries. However, the conclusions drawn on the basis of these experiments vary widely. Results of some studies have led to the conclusion that semantic information serves no purpose and even degrades results, while others have concluded that the use of semantic information drawn from external resources significantly increases the performance of retrieval software. At this point, several question arise: • Why do these conclusions vary so widely? • Is the divergence a result of differences in methodology? • Is the divergence a result of a difference in resources? What are the most suitable resources? • Do results using manually constructed resources differ in significant ways from results using automatically extracted information? • From corpus building to terminology structuring, to which methodological requirements resources acquisition has to comply with in order to be relevant to a given application? • What is the contribution of specialized resources? • Are present frameworks for evaluation (e.g., TREC) appropriate • for evaluation of results?. These questions are fundamental not only to research in document retrieval, but also for information searching, question answering, filtering, etc. Their importance is even more acute for multilingual applications, where, for instance, the question of whether to disambiguate before translating is fundamental. Moreover, the increasing diversity of monolingual as well as multilingual documents on the Web invite to focus attention on lexical variability in connection with textual genre and with questioning the resources reusability stance. The goal of this workshop is to bring together researchers in the domain of document retrieval, and in particular, researchers on both sides of the question of the utility of enhancing queries with semantic information gleaned from languages resources and processes. The workshop will provide a forum for presentation of the different points of view, followed by a roundtable in which the participants will assess the state of the art, consider the results of past and on-going work and the possible reasons for the considerable differences in their conclusions. Ultimately, they will attempt to identify future directions for research. Workshop Organisers Christian Fluhr, CEA, France Nancy Ide, Vassar College, USA Claude de Loupy, Sinequa, France Adeline Nazarenko, LIPN, Université de Paris-Nord, France Monique Slodzian, CRIM, INALCO, France Workshop Programme Committee Roberto Basili, Univ. Roma, Italy Olivier Bodenreider, National Library of Medicine, USA Tony Bryant, University of Leeds, United Kingdom Theresa Cabré, IULA-UPF, Spain Phil Edmonds, Sharp Laboratories of Europe LTD, United Kingdom Marc El-Bèze, Université d'Avignon et des Pays de Vaucluse, France Julio Gonzalo, Universidad Nacional de Educación a Distancia, Spain Natalia Grabar, AP-HP & INaLCO, France Thierry Hamon, LIPN, Université de Paris-Nord, France Graeme Hirst, University of Toronto, Canada John Humbley, Univiversité de Paris 7, France Adam Kilgarriff, University of Brighton, United Kingdom Marie-Claude L'Homme, Université de Montréal, Canada Christian Marest, Mediapps, France Patrick Paroubek, LIMSI, France Piek Vossen, Irion Technologies, The Netherlands Pierre Zweigenbaum, AP-HP, Université de Paris 6, France Table of Contents - Brants T., Stolle R.; Finding Similar Documents in Document Collections - Liu H., Lieberman H.; Robust Photo Retrieval Using World Semantics - Loupy C. de, El-Bèze M.; Managing Synonymy and Polysemy in a Document Retrieval System Using WordNet - Mihalcea R. F.; The Semantic Wildcard - Sadat F., Maeda A., Yoshikawa M., Uemura S.; Statistical Query Disambiguation, Translation and Expansion in Cross-Language Information Retrieval - Jacquemin B, Brunand C., Roux C.; Semantic enrichment for information extraction using word sense disambiguation - Ramakrishnan G., Bhattacharyya P.; Word Sense Disambiguation Using Semantic Sets Based on WordNet - Kaji H., Morimoto Y.; Towards sense-Disambiguated Association Thesauri - Balvet A.; Designing Text Filtering Rules: Interaction between General and Specific Lexical - Grabar N., Pierre Zweigenbaum P.; Lexically-Based Terminology Structuring: a Feasibility Study - Chalendar G. de, Grau B.; Query Expansion by a Contextual Use of Classes of Nouns - Krovetz R.; On the Importance of Word Sense Disambiguation for Information Retrieval Finding Similar Documents in Document Collections Thorsten Brants and Reinhard Stolle Palo Alto Research Center (PARC) 3333 Coyote Hill Rd, Palo Alto, CA 94304, USA brants,stolle ¡ @parc.com Abstract Finding similar documents in natural language document collections is a difficult task that requires general and domain-specific world knowledge, deep analysis of the documents, and inference. However, a large portion of the pairs of similar documents can be identified by simpler, purely word-based methods. We show the use of Probabilistic Latent Semantic Analysis for finding similar documents. We evaluate our system on a collection of photocopier repair tips. Among the 100 top-ranked pairs, 88 are true positives. A manual analysis of the 12 false positives suggests the use of more semantic information in the retrieval model. 1. Introduction sentations of document contents that will allow us to assess not only whether two documents are about the same sub- Collections of natural language documents that are fo- ject but also whether two documents actually say the same cused on a particular subject domain are commonly used thing. We are currently focusing on the tasks of computer- by communities of practice in order to capture and share assisted redundancy resolution. We hope that our tech- knowledge. Examples of such “focused document col- niques will eventually extend to support even more ambi- lections” are FAQs, bug-report repositories, and lessons- tious tasks such as the identification and resolution of in- learned systems. As such systems become larger and larger, consistent knowledge, knowledge fusion, question answer- their authors, users and maintainers increasingly need tools ing, and trend analysis. to perform their tasks, such as browsing, searching, manip- We believe that, in general, the automated or computer- ulating, analyzing and managing the collection. In partic- assisted management of collections of natural language ular, the document collections become unwieldy and ulti- documents requires a fine-grained analysis and represen- mately unusable if obsolete and redundant content is not tation of the documents’ contents. This fine granularity continually identified and removed. in turn mandates deep linguistic processing of the text and We are working with such a knowledge-sharing sys- inference capabilities using extensive linguistic and world tem, focused on the repair of photocopiers. It now contains knowledge. Following this approach, our larger research about 40,000 technician-authored free text documents, in group has implemented a prototype, which we will briefly the form of tips on issues not covered in the official man- describe in the next section. This research prototype sys- uals. Such systems usually support a number of tasks that tem is far from complete. Meanwhile, we are investigating help maintain the utility and quality of the document collec- to what extent currently operational techniques are useful to tion. Simple tools, such as keyword search, for example, support at least some of the tasks that arise from the main- can be extremely useful. Eventually, however, we would tenance of focused document collections. We have investi- like to provide a suite of tools that support a variety of gated the utility of Probabilistic Latent Semantic Analysis tasks, ranging from simple keyword search to more elab- (PLSA) (Hofmann, 1999b) for the task of finding similar orate tasks such as the identification of “duplicates.” Fig. 1 documents. Section 3. describes our PLSA model and Sec- shows a pair of similar tips from our corpus. These two tion 4. reports on our experimental results in the context of tips are about the same problem, and they give a similar our corpus of repair tips. In that section, we also attempt analysis as to why the problem occurs. However, they sug- to characterize the types of similarities that are easily de- gest different solutions: Tip 118 is the “official” solution, tected and contrast them to the types that are easily missed whereas Tip 57 suggests a short-term “work-around” fix to by the PLSA technique. Finally, we speculate how sym- the problem. This example illustrates that “similarity” is a bolic knowledge representation and inference techniques complicated notion that cannot always be measured along that rely on a deep linguistic analysis of the documents may a one-dimensional scale. Whether two or more documents be coupled with statistical techniques in order to improve should be considered “redundant” critically depends on the the results. task at hand. In the example of Fig. 1, the work-around tip may seem redundant and obsolete to a technician who has 2. Knowledge-Based Approach the official new safety cable available. In the absence of this Our goal is to build a system that supports a wide range official part, however, the work-around tip