<<

Linguistic Knowledge can Improve Retrieval

William A. Woods and Lawrence A. Bookman* and Ann Houston and Robert J. Kuhns and Paul Martin and Stephen Green Sun Microsystems Laboratories 1 Network Drive Burlington, MA 01803 {William.Woods,Ann.Houston,Robert.Kuhns,Paul.Martin,Stephen.Green}@east.sun.com

Abstract 2 Conceptual Indexing This paper describes the results of some experiments The conceptual indexing and retrieval system used using a new approach to that for these experiments automatically extracts words combines techniques from natural language process- and phrases from unrestricted text and organizes ing and knowledge representation with a penalty- them into a semantic network that integrates syn- based technique for relevance estimation and passage tactic, semantic, and morphological relationships. retrieval. Unlike many attempts to combine natural The resulting conceptual (Woods, 1997) is language processing with , these used by a specific passage-retrieval algorithm to deal results show substantial benefit from using linguistic with many paraphrase relationships and to find spe- knowledge. cific passages of text where the information sought is likely to occur. It uses a lexicon containing syntac- 1 Introduction tic, semantic, and morphological information about An online information seeker often fails to find what words, word senses, and phrases to provide a base is wanted because the words used in the request are source of semantic and morphological relationships different from the words used in the relevant mate- that are used to organize the taxonomy. In addi- rial. Moreover, the searcher usually spends a signifi- tion, it uses an extensive system of knowledge-based cant amount of time reading retrieved material in or- morphological rules and functions to analyze words der to determine whether it contains the information that are not already in its lexicon, in order to con- sought. To address these problems, a system has struct new lexical entries for previously unknown been developed at Sun Microsystems Laboratories words (Woods, 2000). In addition to rules for han- (Ambroziak and Woods, 1998) that uses techniques dling derived and inflected forms of known words, from natural language processing and knowledge the system includes rules for lexical compounds and representation, with a technique for dynamic pas- rules that are capable of making reasonable guesses sage selection and scoring, to significantly improve for totally unknown words. retrieval performance. This system is able to locate A pilot version of this indexing and retrieval specific passages in the indexed material where the system, implemented in Lisp, uses a collection of requested information appears to be, and to score approximately 1200 knowledge-based morphologi- those passages with a penalty-based score that is cal rules to extend a core lexicon of approximately highly correlated with the likelihood that they con- 39,000 words to give coverage that exceeds that of an tain relevant information. This ability, which we call English lexicon of more than 80,000 base forms (or "Precision Content Retrieval" is achieved by com- 150,000 base plus inflected forms). Later versions bining a system for Conceptual Indexing with an of the conceptual indexing and retrieval system, im- algorithm for Relaxation-Ranking Specific Passage plemented in C++, use a lexicon of approximately Retrieval. 150,000 word forms that is automatically generated In this paper, we show how linguistic knowledge is by the Lisp-based morphological analysis from its used to improve search effectiveness in this system. core lexicon and an input word list. The base lexicon This is of particular interest, since many previous at- is extended further by an extensive name dictionary tempts to use linguistic knowledge to improve infor- and by further morphological analysis of unknown mation retrieval have met with little or mixed suc- words at indexing time. This paper will describe cess (Fagan, 1989; Lewis and Sparck Jones, 1996; some experiments using several versions of this sys- Sparck Jones, 1998; Varile and Zampolli, 1997; tem. In particular, it will focus on the role that the Voorhees, 1993; Mandala et al., 1999) (but see the linguistic knowledge sources play in its operation. latter for some successes as well). The lexicon used by the conceptual indexing sys- * Lawrence Bookman is now at Torrent Systems, Inc. tem contains syntactic information that can be used

262 for the analysis of phrases, as well as morphologi- 3 Relaxation Ranking and Specific cal and semantic information that is used to relate Passage Retrieval more specific concepts to more general concepts in the conceptual taxonomy. This information is inte- The system we are evaluating uses a technique called grated into the conceptual taxonomy by considering "relaxation ranking" to find specific passages where base forms of words to subsume their derived and as many as possible of the different elements of inflected forms ("root subsumption") and more gen- a query occur near each other, preferably in the eral terms to subsume more specific terms. The sys- same form and word order and preferably closer tem uses these relationships as the basis for infer- together. Such passages are ranked by a penalty ring subsumption relationships between more gen- score that measures the degree of deviation from an eral phrases and more specific phrases according to exact match of the requested phrase, with smaller the intensional subsumption logic of Woods (Woods, penalties being preferred. Differences in morpholog- 1991). ical form and formal subsumption of index terms The largest base lexicon used by this system cur- by query terms introduce small penalties, while in- rently contains semantic subsumption information tervening words, unexplained permutations of word for something in excess of 15,000 words. This infor- order, and crossing sentence boundaries introduce mation consists of basic "kind of" and "instance of" more significant penalties. Elements of a query that information such as the fact that book is a kind of cannot be found nearby introduce substantial penal- document and washing is a kind of cleaning. The ties that depend on the syntactic categories of the lexicon also records morphological roots and affixes missing words. for words that are derived or inflected forms of other When the conceptual indexing system is presented words, and information about different word senses with a query, the relaxation-ranking retrieval algo- and their interrelationships. For example, the con- rithm searches through the conceptual taxonomy for ceptual indexing system is able to categorize becomes appropriately related concepts and uses the posi- black as a kind of color change because becomes is an tions of those concepts in the indexed material to inflected form of become, become is a kind of change, find specific passages that are likely to address the and black is a color. Similarly, color disruption is information needs of the request. This search can recognized as a kind of color change, because the find relationships from base forms of words to de- system recognizes disruption as a derived form of rived forms and from more general terms to more disrupt, which is known in the lexicon to be a kind specific terms, by following paths in the conceptual of damage, which is known to be a kind of change. taxonomy. When using root subsumption as a technique for For example, the following is a passage retrieved information retrieval, it is important to have a core by this system, when applied to the UNIX ® operat- lexicon that knows correct morphological analyses ing system online documentation (the "man pages"): for words that the rules would otherwise analyze in- Query: print a message from the mail tool correctly. For example, the following are some ex- amples of words that could be analyzed incorrectly if 6. -2.84 print mail mail mailtool the correct interpretations were not specified in the lexicon: Print sends copies of all the selected mail items to your default printer. If there are delegate (de4.1eg4.ate) take the legs from no selected items, mailtool sends copies of caress (car + ess) female car those items you axe currently... cashier (cashy 4. er) more wealthy The indicated passage is ranked 6th in a returned daredevil (dared + evil) serious risk list of found passages, indicated by the 6 in the above display. The number -2.84 is the penalty score as- lacerate (lace 4. rate) speed of tatting signed to the passage, and the subsequent words pantry (pant + ry) heavy breathing print, mail, mail, and mailtool indicate the words pigeon (pig + eon) the age of peccaries in the text that are matched to the corresponding content words in the input query. In this case, print ratify (rat 4- ify) infest with rodents is matched to print, message to mail, mail to mail, infantry (infant + ry) childish behavior and tool to mailtool, respectively. This is followed by the content of the actual passage located. The Although they are not always as humorous as the information provided in these hit displays gives the above examples, there are over 3,000 words in the information seeker a clear idea of why the passage core lexicon of 39,000 English words that would re- was retrieved and enables the searcher to quickly ceive false morphological analyses like the above ex- skip down the hit list with little time spent looking amples, if the words were not already in the lexicon. at irrelevant passages. In this case, it was easy to

263 identify that the 6th ranked hit was the best one and 4 Experimental Evaluation contained the relevant information. In order to evaluate the effectiveness of the above The retrieval of this passage involved use of a se- techniques, a set of 90 queries was collected from mantic subsumption relationship to match message a naive user of the UNIX , 84 of to mail, because the lexical entry for mail recorded which could be answered from the online documen- that it was a kind of message. It used a morpho- tation known as the man pages. A set of "correct" logical root subsumption to match tool to mailtool answers for each of these 84 queries was manually de- because the morphological analyzer analyzed the un- termined by an independent UNIX operating system known word mailtool as a compound of mail and tool expert, and a snapshot of the man pages collection and recorded that its root was tool and that it was was captured and indexed for retrieval. In order a kind of tool modified by mail. Taking away the to compare this methodology with classical docu- ability to morphologically analyze unknown words ment retrieval techniques, we assign a ranking score would have blocked the retrieval of this passage, to each document equal to the ra~king score of the as would eliminating the lexical subsumption entry best ranked passage that it contains. that recorded mail as a kind of message. In rating the performance of a given method, we Like other approaches to passage retrieval compute average recall and precision values at 10 (Kaszkiel and Zobel, 1997; Salton et al., 1993; retrieved documents, and we also compute a "suc- Callan, 1994), the relaxation-ranking retrieval algo- cess rate" which is simply the percentage of queries rithm identifies relevant passages rather than simply for which an acceptable answer occurs in the top identifying whole documents. However, unlike ap- ten hits. The success rate is the principal factor proaches that involve segmenting the material into on which we base our evaluations, since for this ap- paragraphs or other small passages before indexing, plication, the user is not interested in subsequent this algorithm dynamically constructs relevant pas- answers once an acceptable answer has been found, sages in response to requests. When responding to and finding one answer for each of two requests is a a request, it uses information in the index about po- substantially better result than finding two answers sitions of concepts in the text to identify relevant to one request and none for another. passages. In response to a single request, identified These experiments were conducted using an ex- passages may range in size from a single word or perimental retrieval system that combined a Lisp- phrase to several sentences or paragraphs, depend- based language processing stage with a C++ im- ing on how much context is required to capture the plementation of a conceptual indexer. The linguis- various elements of the request. tic knowledge sources used in these experiments in- cluded a core lexicon of approximately 18,000 words, In a to the specific passage retrieval a substantial set of morphological rules, and spe- system, retrieved passages are reported to the user cialized morphological algorithms covering inflec- in increasing order of penalty, together with the rank tions, prefixes, suffixes, lexical compounding, and number, penalty score, information about which tar- a variety of special forms, including numbers, ordi- get terms match the corresponding query terms, and nals, Roman numerals, dates, phone numbers, and the content of the identified passage with some sur- acronyms. In addition, they made use of a lexical rounding context as illustrated above. In one version subsumption taxonomy of approximately 3000 lex- of this technology, results are presented in a hyper- ical subsumption relations, and a small set of se- text interface that allows the user to click on any mantic entailment axioms (e.g., display entails see, of the presented items to see that passage in its en- but is not a kind of see). This system is described tire context in the source document. In addition, in (Woods, 1997). The was a snapshot of the user can be presented with a display of portions the local man pages (frozen at the time of the ex- of the conceptual taxonomy related to the terms in periment so that it wouldn't change during the ex- the request. This frequently reveals useful gener- periment), consisting of approximately 1800 files of alizations of the request that would find additional varying lengths and constituting a total of approxi- relevant information, and it also conveys an under- mately 10 megabytes of text. standing of what concepts have been found in the Table 1 shows the results of comparing three ver- material that will be matched by the query terms. sions of this technology with a textbook implementa- For example, in one experiment, searching the on- tion of the standard tfid] algorithm (Salton, 1989) line documentation for the Emacs text editor, the and with the SearchItWMsearch application devel- request jump to end of file resulted in feedback show- oped at Sun Microsystems, Inc., which combines a ing that jump was classified as a kind of move in the conceptual taxonomy. This led to a reformulated request, move to end of file, which successfully re- trieved the passage 9o to end of buffer.

264 Table 1: A comparison of different retrieval techniques. Recall Precision System Success Rate (10 docs) (10 docs) tfidf 28.6% 14.8% 2.9% SearchIt system 44.0% 28.5% 7.4% Recall II 60.7% 38.6% 7.3% w/o morph 50.0% not measured not measured w/o knowledge 42.9% not measured not measured

simple morphological with a state- It turned out that the additional relevant documents of-the-art commercial . In the table, found were more than offset by additional irrelevant Recall II refers to the full conceptual indexing and documents that were also ranked more highly. search system with all of its knowledge sources and rules. The line labeled "w/o morph" refers to this 6 Anecdotal Evaluation of Specific system with its dynamic morphological rules turned Passage Retrieval Benefits off, and the line labeled "w/o knowledge" refers to this system with all of its knowledge sources and As mentioned above, comparing the relaxation- rules turned off. The table presents the success ranking algorithm with systems rate and the measured recall and precision values measures only a part of the benefit of the specific for 10 retrieved documents. We measured recall and passage retrieval methodology. Fully evaluating the precision at the 10 document level because inter- quality and ranking of the retrieved passages in- nal studies of searching behavior had shown that volves a great many subtleties. However, two in- users tended to give up if an answer was not found formal evaluations have been conducted that :shed in the first ten ranked hits. We measured success some light on the benefits. rate, rather than recall and precision, for our ab- The first of these was a pilot study of the tech- lation studies, because standard recall and precision nology at a telecommunications company. In that measures are not sensitive to the distinction between study, one user found that she could use a single finding multiple answers to a single request versus query to the conceptual indexing system to find both finding at least one answer for more requests. of the items of information necessary to complete a task that formerly required searching two separate 5 Discussion . The conclusion of that study was that the concept retrieval technology performs well enough to Table 1 shows that for this task, the relaxation- be useful to a person talking live with a customer. ranking passage retrieval algorithm without its sup- It was observed that the returned hits can be com- plementary knowledge sources (Recall II w/o knowl- pared with one another easily and quickly by eye, edge) is roughly comparable in performance (42.9% and attention is taken directly to the relevant con- versus 44.0% success rate) to a state-of-the-art com- tent of a large document: The automatic indexing mercial search engine (SearchIt) at the pure docu- was considered a plus compared with manual meth- ment retrieval task (neglecting the added benefit of ods of content indexing. It was observed that an area locating the specific passages). Adding the knowl- of great potential may be in a form of knowledge edge in the core lexicon (which includes morpho- management that involves organizing and providing logical relationships, semantic subsumption axioms, intelligent access to small, unrelated "nuggets" of and entailment relationships), but without morpho- textual knowledge that are not amenable to conven- logical analysis of unknown words (Recall II w/o tional database archival or . morph), significantly improves these results (from A second experiment was conducted by the Hu- 42.9% to 50.0%). Further adding the morphologi- man Resources Webmaster of a high-tech company, cal analysis capability that automatically analyzes an experienced user of search engines who used this unknown words (deriving additional morphological technology to index his company's internal HR web relationships and some semantic subsumption rela- site. He then measured the time it took him to pro- tionships) significantly improves that result (from cess 15 typical HR requests, first using conventional 50.0% to 60.7%). In contrast, we found that adding search tools that he had available, and then using the same semantic subsumption relationships to the the Conceptual Indexing technology. In both cases, commercial search engine, using its provided the- he measured the time it took him to either find the saurus capability degraded its results, and results answer or to conclude that the answer wasn't in were still degraded when we added only those facts the indexed material. His measured times for the that we knew would help find relevant documents. total suite were 55 minutes using the conventional

265 tools and 11 minutes using the conceptual index- scribed here. These include: Gary Adams, Jacek ing technology. Of course, this was an uncontrolled Ambroziak, Cookie Callahan, Chris Colby, Jim experiment, and there is some potential that infor- Flowers, Ellen Hays, Patrick Martin, Peter Norvig, mation learned from searching with the traditional Tony Passera, Philip Resnik, Robert Sproull, and tools (which were apparently used first) might have Mark Torrance. provided some benefit when using the conceptual in- Sun, Sun Microsystems, and SearchIt are trade- dexing technology. However, the fact that he found marks or registered trademarks of Sun Microsys- things with the latter that he did not find with the tems, Inc. in the U.S. and other countries. former and the magnitude of the time difference sug- UNIX is a registered trademark in the United gests that there is an effect, albeit perhaps not as States and other countries, exclusively licensed great as the measurements. As a result of this ex- through X/Open Company, Ltd. UNIX est une perience, he concluded that he would expect many marque enregistree aux Etats-Unis et dans d'autres users to take much longer to find materials or give pays et licenci~e exclusivement par X/Open Com- up, when using the traditional tools. He anticipated pany Ltd. that after finding some initial materials, more time would be required, as users would end up having References to call people for additional information. He esti- Jacek Ambroziak and William A. Woods. 1998. mated that users could spend up to an hour trying Natural language technology in precision content to get the information they needed...having to call retrieval. In International Conference on Natural someone, wait to make contact and finally get the Language Processing and Industrial Applications, information they needed. Using the conceptual in- Moncton, New Brunswick, Canada, August. dexing search engine, he expected that these times www.stm.com/research/techrep/1998/abstract- would be at least halved. 69.html. Jamie P. Callan. 1994. Passage-level evidgnce in 7 Conclusion document retrieval. SIGIR, pages 302-309. We have described some experiments using lin- J. L. Fagan. 1989. The effectiveness of a nonsyntac- guistic knowledge in an information retrieval sys- tic approach to automatic phrase indexing for doc- tem in which passages within texts are dynami- ument retrieval. Journal of the American Society cally found in response to a query and are scored for , 40(2):115-132, March. and ranked based on a relaxation of constraints. Martin Kaszkiel and Justin Zobel. 1997. Passage This is a different approach from previous meth- retrieval revisited. SIGIR, pages 302-309. ods of passage retrieval and from previous attempts David D. Lewis and Karen Sparck Jones. 1996. Nat- to use linguistic knowledge in information retrieval. ural language processing for information retrieval. These experiments show that linguistic knowledge CACM, 39(1):92-101. can significantly improve information retrieval per- Rila Mandala, Takenobu Tokunaga, and Hozumi formance when incorporated into a knowledge-based Tanaka. 1999. Combining multiple evidence from relaxation-ranking algorithm for specific passage re- different types of thesaurus for query expansion. trieval. In Proceedings on the 22nd annual international The linguistic knowledge considered here includes A CM SIGIR conference on Research and develop- the use of morphological relationships between ment in information retrieval. ACM-SIGIR. words, taxonomic relationships between concepts, Gerald Salton, James Allan, and Chris Buckley. and general semantic entailment relationships be- 1993. Approaches to passage retrieval in full text tween words and concepts. We have shown that the information systems. SIGIR, pages 49-58. combination of these three knowledge sources can . 1989. Automatic Text Processing. significantly improve performance in finding appro- Addison Wesley, Reading, MA. priate answers to specific queries when incorporated Karen Sparck Jones. 1998. A look back and a look into a relaxation-ranking algorithm. It appears that forward. SIGIR, pages 13-29. the penalty-based relaxation-ranking algorithm fig- Giovanni Varile and Antonio Zampolli, editors. ures crucially in this success, since the addition of 1997. Survey of the State of the Art in Human such linguistic knowledge to traditional information Language Technology. Cambridge Univ. Press. retrieval models typically degrades retrieval perfor- Ellen M. Voorhees. 1993. Using to disam- mance rather than improving it, a pattern that was biguate word senses for text retrieval. In Pro- borne out in our own experiments. ceedings of 16th ACM SIGIR Conference. ACM- SIG1R. Acknowledgments William A. Woods. 1991. Understanding subsump- Many other people have been involved in creating tion and taxonomy: A framework for progress. the conceptual indexing and retrieval system de- In John Sowa, editor, Principles of Semantic

2tiff Networks: Explorations in the Representation o/ Knowledge, pages 45-94. Morgan Kaufmann, San Mateo, CA. William A. Woods. 1997. Conceptual indexing: A better way to organize knowledge. Technical Report SMLI TR-97-61, Sun Microsystems Laboratories, Mountain View, CA, April. www.sun.com/research/techrep/1997/abstract- 61.html. William A. Woods. 2000. Aggressive morphology for robust ]exical coverage. In (these proceedings).

267