Using Coreference for Question Answering

Thomas S. Morton Department of Computer and Information Science University of Pennslyvania t sm6rton@cis, upenn, edu

Abstract Sentences are then raaked and presented to the We present a system which retrieves answers to user. We only examine queries to which an- queries based on coreference relationships be- swers are likely to be stated in a sentence or tween entities and events in the query and doc- noun phrase since answers which are typically uments. An evaluation of this system is given longer are can be difficult to annotate reliably. which demonstrates that the the amount of in- This technology differs from the standard docu- formation that the user must process on aver- ment ranking task in that, if successful the user age, tQ find an answer to their query, is reduced will likely not need to examine any of the re- by an order of magnitude. trieved documents in their entirety. This also differs from the document summarization, - 1 Introduction vided by many Search engines today, in that the Search engines have become ubiquitous as a sentences selected are influenced by the query means for accessing information. When a rank- and are selected across multiple documents. ing of documents is returned by a search en- We view a system such as ours as providing gine the information retrieval task is usually not a secondary level of processing after a small set complete. The document, as a unit of informa- of documents, which the user believes contain tion, is often too large for many users informa- the information desired, have been found. This tion needs and finding information within the first step would likely be provided by a tradi- set of returned documents poses a burden of its tional search engine, thus this technology serves own. Here we examine a technique for extract- as an enhancement to an existing document re- ing sentences from documents which attempts trieval systems rather than a replacement. Ad- to satisfy the users information needs by provid- vancements in document retrieval would only ing an answer to the query presented. The sys- help the performance of a system such as ours tem does this by modeling coreference relation- as these improvements would increase the like- ships between entities and events in the query lihood that the answer to the user's query is in and documents. An evaluation of this system is one of the top ranked documents returned. given which demonstrates that it performs bet- ter than using a standard tf. idf weighting and 3 Approach that the amount of information that the user A query is viewed as identifying a relation to must process on average, to find an answer to which a user desires a solution. This relation their query, is reduced by an order of magnitude will most likely involve events and entities, and over document ranking alone. an answer to this relation will involve the same events and entities. Our approach attempts to 2 Problem Statement find coreference relationships between the enti- A query indicates an informational need by the ties and events evoked by the query and those user to the search engine. The information re- evoked in the document. Based on these rela- quired may take the form of a sentence or even tionships, sentences are ranked, and the highest a noun phrase. Here the task is to retrieve the ranked sentences are displayed to the user. passage of text which contains the answer to The coreference relationships that are mod- the query from a small collection of documents. eled by this system include identity, part-whole,

85 and synonymy relations. Consider the following 4 Implementation query and answer pairs. The relationships above are captured by a num- Query: What did Mark McGwire say ber of different techniques which can be placed about child abuse? in essentially two categories. The first group Sentence: "What kills me is that you finds identity relationships between different in- know there are kids over there who vocations of the same entity in a document. The are being abused or neglected, you second identifies more loosely defined relation- just don't know which ones" McGwire ships such as part-whole and synonymy. Each of says. the relationships identified is given a weight and based on the weights and relationships them- In the above query answer pair the system at- selves sentences are ranked and presented to the tempts to capture the identity relationship be- user. tween Mark McGwire and McGwire by deter- mining that the term McGwire in this sentence 4.1 Identity Relationships is coreferent with a mention of Mark McGwire earlier in the document.. This allows the sys- Identity relationships are first determined be- tem to rank this sentence equivalently to a sen- tween the string instantiations of entities in sin- tence mentioning the full name. The system gle documents. This is done so that the dis- also treats the term child abuse as a nominaliza- course context in which these strings appear tion which allows it to speculate that the term can be taken into account. The motivation for abused in the sentence is a related event. Finally this comes in part from example texts where the the verb neglect occurs frequently within doc- same last name will be used to refer to differ- uments which contain the verb abuse, which is ent individuals in the same family. This is of- nominalized in the query, so this term is treated ten unambiguous because full names are used in as a related event. The system does not cur- previous sentences, however this requires some rently have a mechanism which tries to capture modeling of which entities are most salient in the relationship between kids and children. the . These relations are determined using techniques described in (Baldwin et al., Query: Why did the U.S. bomb Su- 1998). dan? Another source of identity relationships Sentence: Last month, the United is morphological and word order variations. States launched a cruise missile at- Within noun phrases in the query the sys- tack against the Shifa Pharmaceuti- tem constructs other possible word combina- cal Industries plant in Khartoum, al- tions which contain the head word of the noun leging that U.S. intelligence agencies phrase. For example a noun phrase such as "the have turned up evidence - including photographed little trouper" would be extended soil samples - showing that the plant to include "the photographed trouper", "the lit- was producing chemicals which could tle tropper", and "the trouper" as well as vari- be used to make VX, a deadly nerve ations excluding the determiner. Each of the gas. variations is given a weight based on the ratio of the score that the new shorter term would have In this example one of the entity-based relation- received if it had appeared in the query and the ships of interest is the identity relationship be- actual noun phrase that occured. The morpho- tween U.S. and United States. Also of interest is logical roots of single word variations are also the part-whole relationship between Sudan and added to the list a possible terms which refer Khartoum, it's capital. Finally the bomb event to the entity or event with no additional deduc- is related to the launch/attack event. The sys- tion in weighting. Finally query entities which tem does not currently have a mechanism which are found in an acronym database are added to tries to capture the relationship between Why the list of corefering terms as well with a weight and alleging or evidence. of 1.

86 4.2 Part-Whole and Synonymy where weightwl is the weight assigned during Relationships one of the previous term expansion phases and The system captures part-wt~ole and synonymy idf is defined above. The weightwl function is relationships by examining co-occurrence statis- defined to be 0 for any term w2 for which no tics between certain classes of words. Specif- expansion took place. The score for the a par- ically co-occurrence statistics are gathered on ticular entity or event in the document with re- verbs and nominalization which co-occur much spect to an entity or event in the query is the more often then one would expect based on maximum value of S(Wl,W2) over all values of chance alone. This is also done for proper Wl and w2 for that entity or event. A particular nouns. For each verbal pair or proper noun pair sentence's score is computed as the sum of the the mutual information between the two is com- scores of the set of entities and events it evokes. puted as follows: For the purpose of evaluation a baseline sys- tem was also constructed. This system fol- I(wl, w2) " " p(Wl' w2) lowed a more standard information retrieval ap- = ,ogtf l)p -C2 )) proach to text ranking described in (Salton, 1989). Each token in the the query is assigned where Wl and w2 are words and an event is de- an idf score also based on the same corpus of fined as a word occuring in a document. All Wall Street Journal articles as used with the words w2 for which I(wl, w2) exceeds a thresh- other system. Query expansion simply con- old where Wl is a query term are added to the sisted of stemming the tokens using a version Of list of terms with which the query term can be the Porter stemmer and sentences were scored referred to. This relationship is given with a as a sum of all matching terms, giving the fa- weight of I(wl, w2)/N where N is a normaliza- miliar t f . idf measure. tion constant. The counts for the mutual infor- mation statistics were gathered from a corpus of 5 Evaluation over 62,000 Wall Street Journal articles which For the evaluation of the system ten queries have been automatically tagged and parsed. were selected from a collection of actual queries 4.3 Sentence Ranking presented to an online search engine. Queries Before sentence ranking begins each entity or were selected based on their expressing the users information need clearly, their being likely an- event in the query is assigned a weight. This weight is the sum of inverse document frequency swered in a single sentence, and non-dubious in- measure of the entity or events term based on tent. The queries used in this evaluation are as its occurrence in the Wall Street Journal corpus follows: described in the previous section. This measure • Why has the dollar weakened against the is computed as: yen? • What was the first manned Apollo mission idf (wl ) --lOg(df~wl)) to circle the moon?

where N is the total number of documents in the • What virus was spread in the U.S. in 1968? corpus and dr(w1) is the number of documents • Where were the 1968 Summer Olympics which contain word Wl. Once weighted, the sys- held? tem compares the entities and events evoked by the query with the entities and events evoked by • Who wrote "The Once and Future King"? the document. The comparison is done via sim- • What did Mark McGwire say about child ple string matching against all the terms with abuse? which the system has determined an entity or event can be referred to. Since these term ex- • What are the symptoms of Chronic Fatigue pansions are weighted the score for for a partic- Syndrome? ular term w2 and a query term Wl is: • What kind of tanks does Israel have? S(Wl, w2) = idf(wl) x weightwl (W2) • What is the life span of a white tailed deer?

87 • Who was the first president of Turkey? entities and perhaps a single event or a group of related events it is hoped that the co-occurrence The information requested by the query was statistics gathered will reveal good candidates then searched for from a data source which was for alternate ways in which the query entities considered likely to contain the answer. Sources and events can be lexicalized. for these experiments include Britannica On- This work employs many of the techniques line, CNN, and the Web at large. Once a used by (Baldwin and Morton, 1998) for per- promising set of documents were retrieved, the forming query based summarization. Here how- top ten were annotated for instances of the an- ever the retrieved information attempts to meet swer to the query. The system was then asked to the users information needs rather then help- process the ten documents and present a ranked ing the user determine whether the entire doc- listing of sentences. ument being summarized possibly meets that System performance is presented below as the need. This system also differs in that it can top ranked sentence which contained an answer present the user with information from multi- to the question. A question mark is used to ple documents. While query sensitive multi- indicate that an answer did not appear in the document systems exist (Mani and Bloedorn, top ten ranked sentences. 1998), evaluating such systems for the purpose Query First answer's rank of comparison is difficult. Full System Baseline Our evaluation shows that the system per- 2 4 forms better than the baseline although the 2 3 baseline performs surprisingly well. We believe 8 6 that this is, in part, due to the lack of any 2 4 notion of recall in the evaluation. While all 7 8 queries were answered by multiple sentences, 1 3 for some queries such as 4,5 and 10 it is not 4 ? clear what benefit the retrieval of additional ? sentences would have. The baseline benefited from the fact that at least one of the answers 9 1 1 typically contained most of the query terms. 10 1 1 Classifying queries as single answer or multi- 6 Discussion ple answer, and evaluating them separately may provide a sharper distinction in performance. Sentence extraction and ranking while similar Comparing the users task with and with- in its information retrieval goals with document out the system reveals a stark contrast in the ranking appears have very different properties. amount of information needed to be processed. While a document can often stand alone in its On average the system required 290 bytes of interpretation the interpretation of a sentence is text to display the answer to the query to the very dependent on the context in which it ap- user. In contrast, had the user reviewed the pears. The modeling of the discourse gives the documents in the order presented by the search entity based system an advantage over a token engine, the answer on average, would appear based models in situations where referring ex- after more than 3000 bytes of text had been pressions which provide little information out- displayed. side of their discourse context can be related to the query. The most extreme example case of 7 Future Work this being the use of . The query expansion techniques presented As a preliminary investigation into this task here are simplistic compared to many used in many areas of future work were discovered. for information retrieval however they are try- ing to capture different phenomenon. Here the 7.1 Term Modeling goal is to capture different lexicalizations of the The treatment of entities and events needs to same entities and events. Since short news ar- be extended to model the nouns which indicate ticles are likely to on a small number of events more robustly and to exclude relational

88 verbs from consideration as events. A proba- Seventh Message Understanding Conference bilistic model of pronouns where are (MUC-7), Baltimore, Maryland. treated as the basis for term expansion should Inderjeet Mani and Eric Bloedorn. 1998. Ma- also be considered. Another area which requires chine learning of generic and user-focused attention is wh-words. Even a simple model summarization. In Proceeding of the Fifteenth would likely reduce the space of entities con- National Conference on Artificial intelligence sidered relevant in a sentence. (AAAI-98). Gerald Salton. 1989. Automatic text process- 7.2 Tools ing: the transformation, analysis, and re- In order to be more effective the models used for trieval of information by computer. Addison- basic linguistic annotation, specifically the part Wesley Publishing Company, Inc. of speech tagger, would need trained on a wider class of questions than is available in the Penn . The incorporation of a Name Entity Recognizer would provide additional categories on which co-occurrence statistics could be based and would likely prove helpful in the modeling of wh-words. 7.3 User Interaction Finally since many of the system's components are derived from unsupervised corpus analysis, the system's language models could be updated as the user searches. This may better charac- terize the distribution of words in the areas the user is interested which could improve perfor- mance for that user.

8 Conclusion We have presented a system which ranks sen- tences such that the answer to a users query will be presented on average in under 300 bytes. This system does this by finding entities and events shared by the query and the documents and by modeling coreference relationships be- tween them. While this is a preliminary inves- tigation and many areas of interest have yet to be explored, the reduction in the amount of text the user must process, to obtain the answers they want, is already dramatic.

References Breck Baldwin and Thomas Morton. 1998. Dy- namic coreference-based summarization. In Proceedings of the Third Conference on Em- pirical Methods in Natural Language Process- ing, Granada, Spain, June. B. Baldwin, T. Morton, Amit Bagga, J. Baldridge, R. Chandraseker, A. Dim- itriadis, K. Snyder, and M. Wolska. 1998. Description of the UPENN CAMP system as used for coreference. In Proceedings of the

89