Creating and Using Semantics for Information Retrieval and Filtering

Total Page:16

File Type:pdf, Size:1020Kb

Creating and Using Semantics for Information Retrieval and Filtering Creating and Using Semantics for Information Retrieval and Filtering State of the Art and Future Research Several experiments have been carried out in the last 15 years investigating the use of various resources and techniques (e.g., thesauri, synonyms, word sense disambiguation, etc.) to help refine or enhance queries. However, the conclusions drawn on the basis of these experiments vary widely. Results of some studies have led to the conclusion that semantic information serves no purpose and even degrades results, while others have concluded that the use of semantic information drawn from external resources significantly increases the performance of retrieval software. At this point, several question arise: • Why do these conclusions vary so widely? • Is the divergence a result of differences in methodology? • Is the divergence a result of a difference in resources? What are the most suitable resources? • Do results using manually constructed resources differ in significant ways from results using automatically extracted information? • From corpus building to terminology structuring, to which methodological requirements resources acquisition has to comply with in order to be relevant to a given application? • What is the contribution of specialized resources? • Are present frameworks for evaluation (e.g., TREC) appropriate • for evaluation of results?. These questions are fundamental not only to research in document retrieval, but also for information searching, question answering, filtering, etc. Their importance is even more acute for multilingual applications, where, for instance, the question of whether to disambiguate before translating is fundamental. Moreover, the increasing diversity of monolingual as well as multilingual documents on the Web invite to focus attention on lexical variability in connection with textual genre and with questioning the resources reusability stance. The goal of this workshop is to bring together researchers in the domain of document retrieval, and in particular, researchers on both sides of the question of the utility of enhancing queries with semantic information gleaned from languages resources and processes. The workshop will provide a forum for presentation of the different points of view, followed by a roundtable in which the participants will assess the state of the art, consider the results of past and on-going work and the possible reasons for the considerable differences in their conclusions. Ultimately, they will attempt to identify future directions for research. Workshop Organisers Christian Fluhr, CEA, France Nancy Ide, Vassar College, USA Claude de Loupy, Sinequa, France Adeline Nazarenko, LIPN, Université de Paris-Nord, France Monique Slodzian, CRIM, INALCO, France Workshop Programme Committee Roberto Basili, Univ. Roma, Italy Olivier Bodenreider, National Library of Medicine, USA Tony Bryant, University of Leeds, United Kingdom Theresa Cabré, IULA-UPF, Spain Phil Edmonds, Sharp Laboratories of Europe LTD, United Kingdom Marc El-Bèze, Université d'Avignon et des Pays de Vaucluse, France Julio Gonzalo, Universidad Nacional de Educación a Distancia, Spain Natalia Grabar, AP-HP & INaLCO, France Thierry Hamon, LIPN, Université de Paris-Nord, France Graeme Hirst, University of Toronto, Canada John Humbley, Univiversité de Paris 7, France Adam Kilgarriff, University of Brighton, United Kingdom Marie-Claude L'Homme, Université de Montréal, Canada Christian Marest, Mediapps, France Patrick Paroubek, LIMSI, France Piek Vossen, Irion Technologies, The Netherlands Pierre Zweigenbaum, AP-HP, Université de Paris 6, France Table of Contents - Brants T., Stolle R.; Finding Similar Documents in Document Collections - Liu H., Lieberman H.; Robust Photo Retrieval Using World Semantics - Loupy C. de, El-Bèze M.; Managing Synonymy and Polysemy in a Document Retrieval System Using WordNet - Mihalcea R. F.; The Semantic Wildcard - Sadat F., Maeda A., Yoshikawa M., Uemura S.; Statistical Query Disambiguation, Translation and Expansion in Cross-Language Information Retrieval - Jacquemin B, Brunand C., Roux C.; Semantic enrichment for information extraction using word sense disambiguation - Ramakrishnan G., Bhattacharyya P.; Word Sense Disambiguation Using Semantic Sets Based on WordNet - Kaji H., Morimoto Y.; Towards sense-Disambiguated Association Thesauri - Balvet A.; Designing Text Filtering Rules: Interaction between General and Specific Lexical - Grabar N., Pierre Zweigenbaum P.; Lexically-Based Terminology Structuring: a Feasibility Study - Chalendar G. de, Grau B.; Query Expansion by a Contextual Use of Classes of Nouns - Krovetz R.; On the Importance of Word Sense Disambiguation for Information Retrieval Finding Similar Documents in Document Collections Thorsten Brants and Reinhard Stolle Palo Alto Research Center (PARC) 3333 Coyote Hill Rd, Palo Alto, CA 94304, USA brants,stolle ¡ @parc.com Abstract Finding similar documents in natural language document collections is a difficult task that requires general and domain-specific world knowledge, deep analysis of the documents, and inference. However, a large portion of the pairs of similar documents can be identified by simpler, purely word-based methods. We show the use of Probabilistic Latent Semantic Analysis for finding similar documents. We evaluate our system on a collection of photocopier repair tips. Among the 100 top-ranked pairs, 88 are true positives. A manual analysis of the 12 false positives suggests the use of more semantic information in the retrieval model. 1. Introduction sentations of document contents that will allow us to assess not only whether two documents are about the same sub- Collections of natural language documents that are fo- ject but also whether two documents actually say the same cused on a particular subject domain are commonly used thing. We are currently focusing on the tasks of computer- by communities of practice in order to capture and share assisted redundancy resolution. We hope that our tech- knowledge. Examples of such “focused document col- niques will eventually extend to support even more ambi- lections” are FAQs, bug-report repositories, and lessons- tious tasks such as the identification and resolution of in- learned systems. As such systems become larger and larger, consistent knowledge, knowledge fusion, question answer- their authors, users and maintainers increasingly need tools ing, and trend analysis. to perform their tasks, such as browsing, searching, manip- We believe that, in general, the automated or computer- ulating, analyzing and managing the collection. In partic- assisted management of collections of natural language ular, the document collections become unwieldy and ulti- documents requires a fine-grained analysis and represen- mately unusable if obsolete and redundant content is not tation of the documents’ contents. This fine granularity continually identified and removed. in turn mandates deep linguistic processing of the text and We are working with such a knowledge-sharing sys- inference capabilities using extensive linguistic and world tem, focused on the repair of photocopiers. It now contains knowledge. Following this approach, our larger research about 40,000 technician-authored free text documents, in group has implemented a prototype, which we will briefly the form of tips on issues not covered in the official man- describe in the next section. This research prototype sys- uals. Such systems usually support a number of tasks that tem is far from complete. Meanwhile, we are investigating help maintain the utility and quality of the document collec- to what extent currently operational techniques are useful to tion. Simple tools, such as keyword search, for example, support at least some of the tasks that arise from the main- can be extremely useful. Eventually, however, we would tenance of focused document collections. We have investi- like to provide a suite of tools that support a variety of gated the utility of Probabilistic Latent Semantic Analysis tasks, ranging from simple keyword search to more elab- (PLSA) (Hofmann, 1999b) for the task of finding similar orate tasks such as the identification of “duplicates.” Fig. 1 documents. Section 3. describes our PLSA model and Sec- shows a pair of similar tips from our corpus. These two tion 4. reports on our experimental results in the context of tips are about the same problem, and they give a similar our corpus of repair tips. In that section, we also attempt analysis as to why the problem occurs. However, they sug- to characterize the types of similarities that are easily de- gest different solutions: Tip 118 is the “official” solution, tected and contrast them to the types that are easily missed whereas Tip 57 suggests a short-term “work-around” fix to by the PLSA technique. Finally, we speculate how sym- the problem. This example illustrates that “similarity” is a bolic knowledge representation and inference techniques complicated notion that cannot always be measured along that rely on a deep linguistic analysis of the documents may a one-dimensional scale. Whether two or more documents be coupled with statistical techniques in order to improve should be considered “redundant” critically depends on the the results. task at hand. In the example of Fig. 1, the work-around tip may seem redundant and obsolete to a technician who has 2. Knowledge-Based Approach the official new safety cable available. In the absence of this Our goal is to build a system that supports a wide range official part, however, the work-around tip
Recommended publications
  • Music Recommendation and Query-By-Content Using Self- Organizing Maps
    Brigham Young University BYU ScholarsArchive Faculty Publications 2009-06-19 Music Recommendation and Query-by-Content Using Self- Organizing Maps Kyle B. Dickerson [email protected] Dan A. Ventura [email protected] Follow this and additional works at: https://scholarsarchive.byu.edu/facpub Part of the Computer Sciences Commons BYU ScholarsArchive Citation Dickerson, Kyle B. and Ventura, Dan A., "Music Recommendation and Query-by-Content Using Self- Organizing Maps" (2009). Faculty Publications. 867. https://scholarsarchive.byu.edu/facpub/867 This Peer-Reviewed Article is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in Faculty Publications by an authorized administrator of BYU ScholarsArchive. For more information, please contact [email protected], [email protected]. Music Recommendation and Query-by-Content Using Self-Organizing Maps Kyle B. Dickerson and Dan Ventura Computer Science Department Brigham Young University kyle [email protected], [email protected] Abstract— The ever-increasing density of computer storage accurately and reduces the content-rich music to a simple text devices has allowed the average user to store enormous quantities string. of multimedia content, and a large amount of this content is Instead, one could in principle extract various acoustic usually music. Current search techniques for musical content rely on meta-data tags which describe artist, album, year, genre, features from the audio using signal processing techniques etc. Query-by-content systems allow users to search based upon with the distance function dependent upon which musical the acoustical content of the songs. Recent systems have mainly features are used.
    [Show full text]
  • Worlds of Musics: Cognitive Ethnomusicological Inquiries On
    Worlds of Musics: Cognitive Ethnomusicological Inquiries on Experience of Time and Space in Human Music-making Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Yong Jeon Cheong Graduate Program in Music The Ohio State University 2019 Dissertation Committee Udo Will, Advisor Georgia Bishop Graeme Boone Copyrighted by Yong Jeon Cheong 2019 2 Abstract This dissertation is a cognitive ethnomusicological investigation regarding how each individual creates his or her own world via different musical behaviors. The goal of this thesis is to contribute to a model of our sense of time and space from an interdisciplinary perspective. There is a long tradition that we use two cognitive constructs, ‘time’ and ‘space’, when talking about the world. In order to understand how we humans construct our own worlds cognitively via music-making, I first distinguished two behaviors in music performance (singing vs. instrument playing). I looked at how the different modes of music-making shape our body in a distinctive way and modifies our perception of time and space. For the cognitive sections (chapters 2 & 3), I discussed not only building blocks of temporal experience but also features of space pertaining to the body. In order to build a comparative perspective (chapter 4), I examined various ancient understandings of time and space in different cultures. In terms of music evolution (chapter 5), I looked at the transformative power of music-making and speculated about potentially different modulatory processes between singing and instrument playing. The discussion in the cognitive sections provided the basic ideas for my ‘Hear Your Touch’ project consisting of two behavioral experiments (chapter 6).
    [Show full text]
  • Surviving the Charts a Research on the Influence of Label Size on Album Performances
    Surviving the charts A research on the influence of label size on album performances. Wouter Derksen 10075070 Thesis Seminar Business Studies Academic year: 2013-2014 Supervisor: Frederik Situmeang Semester 2, Block 3 Abstract The music industry faced a lot of changes in the past ten years. Physical album sales extremely decreased and online album sales and streaming services increased extensively. The way we buy and listen to our favourite music has altered, affecting the influences labels have on album sales. It is important for labels to adapt their ways of making an album sell. During this research a combination of data on album lifespan and discography information per artist is used, acquired by using the Billboard Top 200 Charts and datasets from Rovi. We assessed the influence major labels have on album performances and how genre changes affect this relationship, by conducting regression analysis. As a result we found that as an artist it is more beneficial to be released through a major label since this leads to a longer album lifespan, higher peak rank and higher debut rank in the Billboard Charts. Genre changes in the new released product compared to the rest of the discography seem to lead to a higher peak and debut rank. I would like to thank Caroline Benelux for giving me the opportunity to do an internship at their office in Amsterdam and giving me insights in the music industry. Also many thanks to Michele Piazzai for supporting me with advice and helping me obtain the data used in this paper. 2 University of Amsterdam Introduction Every year over 30.000 albums are being released, just a few of these albums make it to the charts and sell successfully.
    [Show full text]
  • The CHS Sasquatch YOUR KEY to the CAMBRIDGE UNDERGROUND
    The CHS Sasquatch YOUR KEY TO THE CAMBRIDGE UNDERGROUND Special Class of 1999 Twentieth Reunion Year Digital Edition! Released early Jan. 2020 TABLE OF CONTENTS: Nine glorious pages of natural highs! 1... CHS Class of ’99 Reunion News 2-3... Local/State News 4...National News: Decision 2020 (CHS Sasquatch Guide to Democratic Primary Candidates) 5...National News (Continued) 6...International News 7-8...Editor’s Audiovisual 2019, 2010s Favorites (Music & Film) 9...2020 Horoscope Local News Highlights, pg. 37: Not enough neo-nostalgia for you? Try the original high school material Cambridge Clinic Announces Plans to Move to Deerfield, here: http://www.omnifoo.info/pages/Sasquatch.html Will Keep Its Name “Seniors Look Ahead” features possibly containing your name are on pg. 32 (33) of 2~5 and pg. 58-59 of Volume 3. Go reminisce already! Area Seniors Demand 50 New Lawn-Related Ordinances Recommended Generational Reading (not affiliated w/ The CHS Sasquatch): https://waitbutwhy.com/2020/01/its-2020-and-youre-in-the-future.html https://www.businessinsider.com/xennials-born-between-millennials-and-gen-x-2017-11 https://timeline.com/xennials-generation-4d5aa8cf9a24 https://en.wikipedia.org/wiki/Xennials EXCLUSIVE: What have Class of ‘99 alumni done since their last reunion in 2009? •Gotten Married 62% •Gotten Divorced 33% •Embraced Polyamorous, Swinging Lifestyle and Would Love to Tell You All about It 5% •Gained Weight 85% •Lost Weight 10% •Smashed Scale in Rage 5% •Made babies 40% •Written CHS Sasquatch <1% 1 CHS Class of ’99 Reunion News Pictured Below: Sole Survivors of 2019 Cambridge Tornado CAMBRIDGE, WI - In one of the least reported news events across the nation this year, the village of Cambridge, WI, was struck by a devastating, unseasonable tornado in the early fall of 2019, resulting in the deaths of over 99 percent of the population of just over 1,000 and destroying nearly all local buildings.
    [Show full text]