Using Coreference in Question Answering

Thomas S. Morton

Department of Computer and Information Science

University of Pennsylvania

[email protected] du

2.2 Prepro cessing 1 Intro duction

Determining coreference between noun phrases re-

In this pap er we presentanoverview of the system

quires that the noun phrases in the text have b een

used by the GE/Penn team for the the Question

identi ed. This cessing b egins by prepro cessing

Answering Track of TREC-8. Our system uses a

the SGML to determine likely b oundaries between

simple sentence ranking metho d which is enhanced

segments of text, sentence-detecting these segments

by the addition of coreference annotated data as its

using a sentence detector describ ed in Reynar and

input. We will present an overview of our initial

Ratnaparkhi, 1997, and tokenizing those sentences

system and its comp onents. Since this was the rst

using a tokenizer describ ed in Reynar, 1998. The

time this track has b een run, we made numerous

text can then b e part-of-sp eech-tagged using the tag-

additions to our initial system. We will describ e

ger describ ed in Ratnaparkhi, 1996, and nally

these additions and what motivated them as a series

noun phrases are determined using a maximum en-

of lessons learned after which the nal system used

tropy mo del trained on the Penn Marcus

for our submission will b e describ ed. Finally we will

et al., 1994. The output of Nymble Bikel et al.,

discuss directions for future research.

1997, a named-entity recognizer which determines

which words are p eople's names, organizations, lo-

2 Initial System Overview

cations, etc., is also used to aid in determining coref-

The input to our system is a small set of candidate

erence relationships.

do cuments and a query. To get a set of candidate

2.3 Coreference

do cuments we employed a search engine over the

Once prepro cessing is completed, the system iter-

TREC-8 do cument collection and further pro cessed

ates through each of the noun phrases to determine

the top 20 do cuments returned by it for each query.

if it refers to a noun phrase which has o ccurred pre-

In order for our system to annotate coreference rela-

viously. Only prop er noun phrases, de nite noun

tions, avariety of linguistic annotation is required.

phrases, and non-p ossessive third p erson

This includes accounting for SGML tags in the orig-

are considered. Prop er noun phrases are determined

inal do cuments, p erforming sentence detection, tok-

by the part of sp eech assigned to the last word in the

enization, noun phrase detection, and named-entity

noun phrase. A prop er noun phrase is considered

categorization. With this annotation complete, the

coreferent with a previously o ccurring noun phrase

coreference system annotates coreference relations

if it is a substring of that noun phrase, excluding ab-

between noun phrases. The coreference-annotated

breviations and words which are not prop er nouns.

do cument is then passed to a sentence ranker which

A noun phrase is considered de nite if it b egins with

ranks each of the sentences, merging these ranked

the determiner \the" or b egins with a p ossessive

sentences with the sentences from previously pro-

or a past-participle verb. A de nite noun

cessed do cuments. Finally the top 5 sentences are

phrase is considered coreferent with another noun

presented to the user.

phrase if the last word in the noun phrase matches

2.1 Search Engine

the last word in a previously o ccurring noun phrase.

The mechanism for resolving pronouns consists of a In order to get a small collection of candidate do c-

maximum entropy mo del which examines two noun uments, we installed and indexed the TREC-8 data

phrases and pro duces a probability that they co- set with the PRISE 2.0 search engine, develop ed by

refer. The 20 previously o ccurring noun phrases are NIST. Indexing and retrieval were done using the de-

considered as p ossible . The p ossibility that fault con guration with no attempts made to tune

the pronoun refers to none of these noun phrases is the ranking to the Question Answering task. From

also examined. The pair with the highest probabil- this we to ok the top 20 ranked do cuments and p er-

ity are considered coreferent, or the pronoun is left formed further pro cessing on each of them.

sentences equally ranked by the rst score. In cases unresolved when the mo del predicts this as the most

where a sentence was longer than 250 bytes, this likely outcome. The mo del considers the following

noun phrase was used to determine where the sen- features:

tence would b e truncated.

1. The category of the noun phrase b eing consid-

ered as determined by the named-entity recog-

3 Lessons Learned

nizer.

3.1 Lesson 1

2. The numb er of noun phrases that o ccur b etween

Our rst goal was to develop a baseline with which

the candidate noun phrase and the pronoun.

we could compare our system's output. The sim-

3. The numb er of sentences that o ccur b etween the

plest baseline we could imagine would b e to simply

candidate noun phrase and the pronoun.

rank segments of text based on the common tf  idf

measure. Since these segments were small, having

4. Which noun phrase in a sentence is b eing re-

a maximum of 250 bytes, we ignored the term fre-

ferred to  rst, second, ....

quency comp onent, and the query and segmentwere

5. In which noun phrase in a sentence the pronoun

treated as a set of terms rather than a bag. Each

o ccurred  rst, second, ....

segment was ranked based on the sum of the idf

weights for the words in that segment which, once

6. The pronoun b eing considered.

stemmed, matched those found in the query. Each

7. If the pronoun and the noun phrase are com-

segmentwas a 250-byte window centered on a term

patible in numb er.

whichwas also found in the query. On the develop-

8. If the candidate noun phrase is another pro-

ment set provided, this pro duced an answer in the

noun, is it compatible with the referring pro-

top ve sentences for nearly half 17/38 of the ques-

noun?

tions provided for development. This allowed us to

b etter assess the added value various typ es of lin-

9. If the candidate noun phrase is another pro-

guistic annotation would provide.

noun, is it the same as the referring pronoun?

3.2 Lesson 2

The mo del is trained on nearly 1200 annotated ex-

Performing linguistic annotation of the do cuments in

amples of pronouns which refer to or fail to refer to

the collection is computationally exp ensive. While

previously o ccurring noun phrases.

we only examined the top 20 returned do cuments,

2.4 Sentence Ranking

some of these do cuments were very long, often ex-

Sentences are ranked based on the sum of the idf

ceeding 2MB. To combat this, each do cument was

weights Salton, 1989 for each unique term which

reduced to a 20K segment using the 250-byte seg-

o ccurs in the sentence and also o ccurs in the query.

ment ranked rst by the baseline as the center of the

The idf weights are computed based on the do cu-

20K segment. This sp ed up pro cessing considerably

ments found on TREC discs 4 and 5 Vo orhees and

but had no noticeable e ect on system output. This

Harman, 1997. No additional score is given for to-

may b e b ecause some question generation was based

kens o ccurring more than once in a sentence. If a

in part on reading the do cuments and creating ques-

sentence contains a coreferential noun phrase then

tions whichwere answered by that do cument. This

the terms contained in any of the noun phrases with

mayhave lead to a bias for shorter do cuments.

which it is coreferent are also considered to b e con-

3.3 Lesson 3

tained in the sentence.

Many questions indicate a semantic category that A secondary weightwas also used to resolve ties in

the answer should fall in based on the Wh-word the rst weight ranking and to determine how sen-

the sentence uses. For Wh-words such as \Who", tences longer than 250 bytes should be truncated.

\Where", and \When", a fairly sp eci c category The secondary weightwas computed for each noun

is sp eci ed while the category for \What" and phrase based on the sum of the idf weights for each

\Which" is usually sp eci ed by the noun phrase fol- unique term where words o ccurring farther away

lowing it. \How" can b e used to sp ecify a varietyof from the noun phrase were discounted. This was

typ es; however, when it is followed bywords suchas done by adding the pro duct of the idf weight for a

\many", \long", \fast", the answer will likely con- word and the recipro cal of the distance, in words,

tain a numb er of some sort. When the semantic typ e between the noun phrase and the word. For exam-

of the answer could be determined, and this could ple, a word three tokens to the left of a noun phrase

be mapp ed to a category that was determined by would only receive a third of its idf weight with re-

the named-entity recognizer or some other recogniz- sp ect to that noun phrase. This weightwas used to

able pattern, then only sentences which evoked an select a \most central" noun phrase and the weight

entity of the same category were considered. This of this noun phrase was used to resolve ties b etween

This expanded query was then passed to the search included sentences which contained pronouns which

engine, and the top 20 do cuments returned by it referred to entities of the correct typ e in preceding

were collected for further pro cessing. The baseline sentences. Sentences which contained the correct en-

system was run on these do cuments to nd a central titytyp e, but all entities of this typ e were presentin

passage, and a 20K window around this passage was the query,were also ignored. This pro cessing help ed

kept for further pro cessing. Prepro cessing was p er- exclude sentences which only used terms in the query

formed on these segments and coreference relations and would be highly ranked even though they did

between entities and dates were automatically an- not contain a p ossible answer.

notated. Finally, sentences whichweren't excluded

3.4 Lesson 4

by the semantic-category ltering were ranked using

The semantic category for questions which ask for

the simple idf weighting describ ed ab ove. The top-

a date can usually be determined. These are also

ranked sentences were augmented to include com-

categories that the named-entity recognizer identi-

plete descriptions of coreferential terms such as def-

ed and so the system was quite e ective at nding

inite noun phrases, prop er nouns, pronouns, and

candidate answers to these sorts of questions. How-

dates, which were not already present in the sen-

ever, the form of the answer often did not meet the

tence. These augmented sentences were then pre-

needs of the user. Within the context of a newspap er

sented to the user.

article, relative date terms suchastoday, Tuesday,

last week, or next month, can be interpreted by a

5 Results

reader based on context; however, when this con-

The part of the TREC-8 Question Answering Track

text is removed, the meaning of these terms is often

evaluation in which we participated allowed 5 an-

unclear. All the articles in this collection contain

swers to be submitted, each of which could be at

datelines which often make it p ossible to automati-

most 250 bytes long. For the 198 questions in the

cally resolve such terms for the user. For these terms

evaluation, our system was able to answer 126 of

we used the dateline as a base for when

them, or 63.3. If answers are weighted by rank,

the article was written and then used a small set of

our mean recipro cal rank was 0.510. This compared

heuristics to determine a complete description of a

favorably with other systems; of the 20 participants,

date term. Additional terms intro duced by the more

our system ranked 4th overall.

complete description of the relative date term were

also considered to b e in that sentence. This was es-

6 Discussion and Future Work

p ecially helpful when these terms were in the query

Attending TREC-8 provided us with additional in-

and would not have matched this sentence without

sights for future work. The most signi cant of these

such pro cessing. This pro cessing was also helpful

is that in the future more attention needs to be

when presenting sentences to the user. When sen-

paid to indexing. Sp eci cally, we discovered that

tences contained relative date terms, the parts of

the search engine we used, PRISE 2.0, was signi -

the description whichwere not presentinthe rela-

cantly b elow the state-of-the-art in p erformance at

tive date term were inserted after it to improve the

the ad-ho c task. To compare its p erformance at the

user understanding of the text without context.

Question Answering task, we considered all the do c-

3.5 Lesson 5

uments in which some participant had found a cor-

While linguistic pro cessing is helpful in determining

rect answer. This is likely not the complete set of

answers to a variety of questions, some information

do cuments which contain the answer, but it serves as

needs can b e satis ed with much simpler means. For

a reasonable approximation. We then compared the

questions asking \Where is X" or \What is the cap-

numb er of these do cuments that PRISE found com-

ital of X", a go o d online dictionary will usually pro-

pared to AT&T's search engine. The result is that

vide the answer within a few keystrokes. For these

the AT&T search engine returned 146 more do cu-

twotyp es of questions, we automatically extracted

ments, over all queries, in which some system found

a set of probable answers and added these to the

the answer than the PRISE search engine. For 38

query. This improved system p erformance and did

queries, the PRISE system returned no do cuments

a b etter job of addressing the user's intentions than

in which an answer was found, while the AT&T sys-

the system without this information. Sp eci cally,

tem did this for only 35 do cuments. We should also

the dictionary provided answers with a b etter level

explore the p ossiblity of examining more than 20

of generalization than the system did without these

do cuments. This is evidenced by the fact that if all

additional query terms.

200 do cuments returned by the AT&T system are

considered, then a do cument containing an answer

4 Final System

was provided for at least 187 or the 198 queries. In a

similar vein, we also hop e to lo ok at alternate index- Our nal system examined the query and added

ing schemes such as paragraph indexing, whichwas terms from an online dictionary when applicable.

used in Southern Metho dist University's system.

7 Conclusion

Here we present a system for p erforming question

answering on a large collection of text. This sys-

tem uses a simple ranking metho d, which is aided

by determining coreference relations to add terms

to a sentence and by determining the semantic cat-

egory of the answer to exclude some sentences from

consideration. We b elieve coreference plays an im-

p ortant role in question answering, as it allows a sys-

tem to extract answers from text which refers to but

do esn't explicitly mention an entity. It also provides

a means to make text presented to the user without

its original context easier to understand. Determin-

ing the semantic category that the answer will b e in,

and the entities which fall into that category, is also

useful: it allows sentences which do not contain a

p ossible answer to be excluded from consideration.

This system p erformed well at the evaluation and we

lo ok forward to improving its p erformance for future

evaluations.

References

D. Bikel, S. Miller, R. Schwartz, and R. Weischedel.

1997. Nymble: a high-p erformance learning

name- nder. In Proceedings of the Fifth Confer-

ence on Applied Natural Language Processing.

M. Marcus, B. Santorini, and M. Marcinkiewicz.

1994. Building a large annotated corpus of en-

glish: the Penn Treebank. Computational Lin-

guistics, 192:313{330.

Adwait Ratnaparkhi. 1996. A Maximum Entropy

Part of Sp eechTagger. In Eric Brill and Kenneth

Church, editors, Conference on Empirical Meth-

ods in Natural Language Processing, Universityof

Pennsylvania, May 17-18.

Je rey C. Reynar and Adwait Ratnaparkhi. 1997.

A maximum entropy approach to identifying sen-

tence b oundaries. In Proceedings of the Fifth Con-

ference on Applied Natural Language Processing,

pages 16{19, Washington, D.C., April.

Je Reynar. 1998. Topic Segmentation: Algorithms

and Applications. Ph.D. thesis, University of

Pennsylvania.

Gerald Salton. 1989. Automatic text processing: the

transformation, analysis, and retrieval of infor-

mation by computer. Addison-Wesley Publishing

Company, Inc.

Ellen M. Vo orhees and Donna Harman. 1997.

Overview of the fth Text REtrieval Conference

TREC-5. In Proceedings of the Fifth Text RE-

trieval Conference TREC-5, pages 1{28. NIST 500-238.