Linguistic Knowledge Can Improve Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
Linguistic Knowledge can Improve Information Retrieval William A. Woods and Lawrence A. Bookman* and Ann Houston and Robert J. Kuhns and Paul Martin and Stephen Green Sun Microsystems Laboratories 1 Network Drive Burlington, MA 01803 {William.Woods,Ann.Houston,Robert.Kuhns,Paul.Martin,Stephen.Green}@east.sun.com Abstract 2 Conceptual Indexing This paper describes the results of some experiments The conceptual indexing and retrieval system used using a new approach to information access that for these experiments automatically extracts words combines techniques from natural language process- and phrases from unrestricted text and organizes ing and knowledge representation with a penalty- them into a semantic network that integrates syn- based technique for relevance estimation and passage tactic, semantic, and morphological relationships. retrieval. Unlike many attempts to combine natural The resulting conceptual taxonomy (Woods, 1997) is language processing with information retrieval, these used by a specific passage-retrieval algorithm to deal results show substantial benefit from using linguistic with many paraphrase relationships and to find spe- knowledge. cific passages of text where the information sought is likely to occur. It uses a lexicon containing syntac- 1 Introduction tic, semantic, and morphological information about An online information seeker often fails to find what words, word senses, and phrases to provide a base is wanted because the words used in the request are source of semantic and morphological relationships different from the words used in the relevant mate- that are used to organize the taxonomy. In addi- rial. Moreover, the searcher usually spends a signifi- tion, it uses an extensive system of knowledge-based cant amount of time reading retrieved material in or- morphological rules and functions to analyze words der to determine whether it contains the information that are not already in its lexicon, in order to con- sought. To address these problems, a system has struct new lexical entries for previously unknown been developed at Sun Microsystems Laboratories words (Woods, 2000). In addition to rules for han- (Ambroziak and Woods, 1998) that uses techniques dling derived and inflected forms of known words, from natural language processing and knowledge the system includes rules for lexical compounds and representation, with a technique for dynamic pas- rules that are capable of making reasonable guesses sage selection and scoring, to significantly improve for totally unknown words. retrieval performance. This system is able to locate A pilot version of this indexing and retrieval specific passages in the indexed material where the system, implemented in Lisp, uses a collection of requested information appears to be, and to score approximately 1200 knowledge-based morphologi- those passages with a penalty-based score that is cal rules to extend a core lexicon of approximately highly correlated with the likelihood that they con- 39,000 words to give coverage that exceeds that of an tain relevant information. This ability, which we call English lexicon of more than 80,000 base forms (or "Precision Content Retrieval" is achieved by com- 150,000 base plus inflected forms). Later versions bining a system for Conceptual Indexing with an of the conceptual indexing and retrieval system, im- algorithm for Relaxation-Ranking Specific Passage plemented in C++, use a lexicon of approximately Retrieval. 150,000 word forms that is automatically generated In this paper, we show how linguistic knowledge is by the Lisp-based morphological analysis from its used to improve search effectiveness in this system. core lexicon and an input word list. The base lexicon This is of particular interest, since many previous at- is extended further by an extensive name dictionary tempts to use linguistic knowledge to improve infor- and by further morphological analysis of unknown mation retrieval have met with little or mixed suc- words at indexing time. This paper will describe cess (Fagan, 1989; Lewis and Sparck Jones, 1996; some experiments using several versions of this sys- Sparck Jones, 1998; Varile and Zampolli, 1997; tem. In particular, it will focus on the role that the Voorhees, 1993; Mandala et al., 1999) (but see the linguistic knowledge sources play in its operation. latter for some successes as well). The lexicon used by the conceptual indexing sys- * Lawrence Bookman is now at Torrent Systems, Inc. tem contains syntactic information that can be used 262 for the analysis of phrases, as well as morphologi- 3 Relaxation Ranking and Specific cal and semantic information that is used to relate Passage Retrieval more specific concepts to more general concepts in the conceptual taxonomy. This information is inte- The system we are evaluating uses a technique called grated into the conceptual taxonomy by considering "relaxation ranking" to find specific passages where base forms of words to subsume their derived and as many as possible of the different elements of inflected forms ("root subsumption") and more gen- a query occur near each other, preferably in the eral terms to subsume more specific terms. The sys- same form and word order and preferably closer tem uses these relationships as the basis for infer- together. Such passages are ranked by a penalty ring subsumption relationships between more gen- score that measures the degree of deviation from an eral phrases and more specific phrases according to exact match of the requested phrase, with smaller the intensional subsumption logic of Woods (Woods, penalties being preferred. Differences in morpholog- 1991). ical form and formal subsumption of index terms The largest base lexicon used by this system cur- by query terms introduce small penalties, while in- rently contains semantic subsumption information tervening words, unexplained permutations of word for something in excess of 15,000 words. This infor- order, and crossing sentence boundaries introduce mation consists of basic "kind of" and "instance of" more significant penalties. Elements of a query that information such as the fact that book is a kind of cannot be found nearby introduce substantial penal- document and washing is a kind of cleaning. The ties that depend on the syntactic categories of the lexicon also records morphological roots and affixes missing words. for words that are derived or inflected forms of other When the conceptual indexing system is presented words, and information about different word senses with a query, the relaxation-ranking retrieval algo- and their interrelationships. For example, the con- rithm searches through the conceptual taxonomy for ceptual indexing system is able to categorize becomes appropriately related concepts and uses the posi- black as a kind of color change because becomes is an tions of those concepts in the indexed material to inflected form of become, become is a kind of change, find specific passages that are likely to address the and black is a color. Similarly, color disruption is information needs of the request. This search can recognized as a kind of color change, because the find relationships from base forms of words to de- system recognizes disruption as a derived form of rived forms and from more general terms to more disrupt, which is known in the lexicon to be a kind specific terms, by following paths in the conceptual of damage, which is known to be a kind of change. taxonomy. When using root subsumption as a technique for For example, the following is a passage retrieved information retrieval, it is important to have a core by this system, when applied to the UNIX ® operat- lexicon that knows correct morphological analyses ing system online documentation (the "man pages"): for words that the rules would otherwise analyze in- Query: print a message from the mail tool correctly. For example, the following are some ex- amples of words that could be analyzed incorrectly if 6. -2.84 print mail mail mailtool the correct interpretations were not specified in the lexicon: Print sends copies of all the selected mail items to your default printer. If there are delegate (de4.1eg4.ate) take the legs from no selected items, mailtool sends copies of caress (car + ess) female car those items you axe currently... cashier (cashy 4. er) more wealthy The indicated passage is ranked 6th in a returned daredevil (dared + evil) serious risk list of found passages, indicated by the 6 in the above display. The number -2.84 is the penalty score as- lacerate (lace 4. rate) speed of tatting signed to the passage, and the subsequent words pantry (pant + ry) heavy breathing print, mail, mail, and mailtool indicate the words pigeon (pig + eon) the age of peccaries in the text that are matched to the corresponding content words in the input query. In this case, print ratify (rat 4- ify) infest with rodents is matched to print, message to mail, mail to mail, infantry (infant + ry) childish behavior and tool to mailtool, respectively. This is followed by the content of the actual passage located. The Although they are not always as humorous as the information provided in these hit displays gives the above examples, there are over 3,000 words in the information seeker a clear idea of why the passage core lexicon of 39,000 English words that would re- was retrieved and enables the searcher to quickly ceive false morphological analyses like the above ex- skip down the hit list with little time spent looking amples, if the words were not already in the lexicon. at irrelevant passages. In this case, it was easy to 263 identify that the 6th ranked hit was the best one and 4 Experimental Evaluation contained the relevant information. In order to evaluate the effectiveness of the above The retrieval of this passage involved use of a se- techniques, a set of 90 queries was collected from mantic subsumption relationship to match message a naive user of the UNIX operating system, 84 of to mail, because the lexical entry for mail recorded which could be answered from the online documen- that it was a kind of message.