Using Coreference in Question Answering
Thomas S. Morton
Department of Computer and Information Science
University of Pennsylvania
2.2 Prepro cessing 1 Intro duction
Determining coreference between noun phrases re-
In this pap er we presentanoverview of the system
quires that the noun phrases in the text have b een
used by the GE/Penn team for the the Question
identi ed. This pro cessing b egins by prepro cessing
Answering Track of TREC-8. Our system uses a
the SGML to determine likely b oundaries between
simple sentence ranking metho d which is enhanced
segments of text, sentence-detecting these segments
by the addition of coreference annotated data as its
using a sentence detector describ ed in Reynar and
input. We will present an overview of our initial
Ratnaparkhi, 1997, and tokenizing those sentences
system and its comp onents. Since this was the rst
using a tokenizer describ ed in Reynar, 1998. The
time this track has b een run, we made numerous
text can then b e part-of-sp eech-tagged using the tag-
additions to our initial system. We will describ e
ger describ ed in Ratnaparkhi, 1996, and nally
these additions and what motivated them as a series
noun phrases are determined using a maximum en-
of lessons learned after which the nal system used
tropy mo del trained on the Penn Treebank Marcus
for our submission will b e describ ed. Finally we will
et al., 1994. The output of Nymble Bikel et al.,
discuss directions for future research.
1997, a named-entity recognizer which determines
which words are p eople's names, organizations, lo-
2 Initial System Overview
cations, etc., is also used to aid in determining coref-
The input to our system is a small set of candidate
erence relationships.
do cuments and a query. To get a set of candidate
2.3 Coreference
do cuments we employed a search engine over the
Once prepro cessing is completed, the system iter-
TREC-8 do cument collection and further pro cessed
ates through each of the noun phrases to determine
the top 20 do cuments returned by it for each query.
if it refers to a noun phrase which has o ccurred pre-
In order for our system to annotate coreference rela-
viously. Only prop er noun phrases, de nite noun
tions, avariety of linguistic annotation is required.
phrases, and non-p ossessive third p erson pronouns
This includes accounting for SGML tags in the orig-
are considered. Prop er noun phrases are determined
inal do cuments, p erforming sentence detection, tok-
by the part of sp eech assigned to the last word in the
enization, noun phrase detection, and named-entity
noun phrase. A prop er noun phrase is considered
categorization. With this annotation complete, the
coreferent with a previously o ccurring noun phrase
coreference system annotates coreference relations
if it is a substring of that noun phrase, excluding ab-
between noun phrases. The coreference-annotated
breviations and words which are not prop er nouns.
do cument is then passed to a sentence ranker which
A noun phrase is considered de nite if it b egins with
ranks each of the sentences, merging these ranked
the determiner \the" or b egins with a p ossessive
sentences with the sentences from previously pro-
pronoun or a past-participle verb. A de nite noun
cessed do cuments. Finally the top 5 sentences are
phrase is considered coreferent with another noun
presented to the user.
phrase if the last word in the noun phrase matches
2.1 Search Engine
the last word in a previously o ccurring noun phrase.
The mechanism for resolving pronouns consists of a In order to get a small collection of candidate do c-
maximum entropy mo del which examines two noun uments, we installed and indexed the TREC-8 data
phrases and pro duces a probability that they co- set with the PRISE 2.0 search engine, develop ed by
refer. The 20 previously o ccurring noun phrases are NIST. Indexing and retrieval were done using the de-
considered as p ossible referents. The p ossibility that fault con guration with no attempts made to tune
the pronoun refers to none of these noun phrases is the ranking to the Question Answering task. From
also examined. The pair with the highest probabil- this we to ok the top 20 ranked do cuments and p er-
ity are considered coreferent, or the pronoun is left formed further pro cessing on each of them.
sentences equally ranked by the rst score. In cases unresolved when the mo del predicts this as the most
where a sentence was longer than 250 bytes, this likely outcome. The mo del considers the following
noun phrase was used to determine where the sen- features:
tence would b e truncated.
1. The category of the noun phrase b eing consid-
ered as determined by the named-entity recog-
3 Lessons Learned
nizer.
3.1 Lesson 1
2. The numb er of noun phrases that o ccur b etween
Our rst goal was to develop a baseline with which
the candidate noun phrase and the pronoun.
we could compare our system's output. The sim-
3. The numb er of sentences that o ccur b etween the
plest baseline we could imagine would b e to simply
candidate noun phrase and the pronoun.
rank segments of text based on the common tf idf
measure. Since these segments were small, having
4. Which noun phrase in a sentence is b eing re-
a maximum of 250 bytes, we ignored the term fre-
ferred to rst, second, ....
quency comp onent, and the query and segmentwere
5. In which noun phrase in a sentence the pronoun
treated as a set of terms rather than a bag. Each
o ccurred rst, second, ....
segment was ranked based on the sum of the idf
weights for the words in that segment which, once
6. The pronoun b eing considered.
stemmed, matched those found in the query. Each
7. If the pronoun and the noun phrase are com-
segmentwas a 250-byte window centered on a term
patible in numb er.
whichwas also found in the query. On the develop-
8. If the candidate noun phrase is another pro-
ment set provided, this pro duced an answer in the
noun, is it compatible with the referring pro-
top ve sentences for nearly half 17/38 of the ques-
noun?
tions provided for development. This allowed us to
b etter assess the added value various typ es of lin-
9. If the candidate noun phrase is another pro-
guistic annotation would provide.
noun, is it the same as the referring pronoun?
3.2 Lesson 2
The mo del is trained on nearly 1200 annotated ex-
Performing linguistic annotation of the do cuments in
amples of pronouns which refer to or fail to refer to
the collection is computationally exp ensive. While
previously o ccurring noun phrases.
we only examined the top 20 returned do cuments,
2.4 Sentence Ranking
some of these do cuments were very long, often ex-
Sentences are ranked based on the sum of the idf
ceeding 2MB. To combat this, each do cument was
weights Salton, 1989 for each unique term which
reduced to a 20K segment using the 250-byte seg-
o ccurs in the sentence and also o ccurs in the query.
ment ranked rst by the baseline as the center of the
The idf weights are computed based on the do cu-
20K segment. This sp ed up pro cessing considerably
ments found on TREC discs 4 and 5 Vo orhees and
but had no noticeable e ect on system output. This
Harman, 1997. No additional score is given for to-
may b e b ecause some question generation was based
kens o ccurring more than once in a sentence. If a
in part on reading the do cuments and creating ques-
sentence contains a coreferential noun phrase then
tions whichwere answered by that do cument. This
the terms contained in any of the noun phrases with
mayhave lead to a bias for shorter do cuments.
which it is coreferent are also considered to b e con-
3.3 Lesson 3
tained in the sentence.
Many questions indicate a semantic category that A secondary weightwas also used to resolve ties in
the answer should fall in based on the Wh-word the rst weight ranking and to determine how sen-
the sentence uses. For Wh-words such as \Who", tences longer than 250 bytes should be truncated.
\Where", and \When", a fairly sp eci c category The secondary weightwas computed for each noun
is sp eci ed while the category for \What" and phrase based on the sum of the idf weights for each
\Which" is usually sp eci ed by the noun phrase fol- unique term where words o ccurring farther away
lowing it. \How" can b e used to sp ecify a varietyof from the noun phrase were discounted. This was
typ es; however, when it is followed bywords suchas done by adding the pro duct of the idf weight for a
\many", \long", \fast", the answer will likely con- word and the recipro cal of the distance, in words,
tain a numb er of some sort. When the semantic typ e between the noun phrase and the word. For exam-
of the answer could be determined, and this could ple, a word three tokens to the left of a noun phrase
be mapp ed to a category that was determined by would only receive a third of its idf weight with re-
the named-entity recognizer or some other recogniz- sp ect to that noun phrase. This weightwas used to
able pattern, then only sentences which evoked an select a \most central" noun phrase and the weight
entity of the same category were considered. This of this noun phrase was used to resolve ties b etween
This expanded query was then passed to the search included sentences which contained pronouns which
engine, and the top 20 do cuments returned by it referred to entities of the correct typ e in preceding
were collected for further pro cessing. The baseline sentences. Sentences which contained the correct en-
system was run on these do cuments to nd a central titytyp e, but all entities of this typ e were presentin
passage, and a 20K window around this passage was the query,were also ignored. This pro cessing help ed
kept for further pro cessing. Prepro cessing was p er- exclude sentences which only used terms in the query
formed on these segments and coreference relations and would be highly ranked even though they did
between entities and dates were automatically an- not contain a p ossible answer.
notated. Finally, sentences whichweren't excluded
3.4 Lesson 4
by the semantic-category ltering were ranked using
The semantic category for questions which ask for
the simple idf weighting describ ed ab ove. The top-
a date can usually be determined. These are also
ranked sentences were augmented to include com-
categories that the named-entity recognizer identi-
plete descriptions of coreferential terms such as def-
ed and so the system was quite e ective at nding
inite noun phrases, prop er nouns, pronouns, and
candidate answers to these sorts of questions. How-
dates, which were not already present in the sen-
ever, the form of the answer often did not meet the
tence. These augmented sentences were then pre-
needs of the user. Within the context of a newspap er
sented to the user.
article, relative date terms suchastoday, Tuesday,
last week, or next month, can be interpreted by a
5 Results
reader based on context; however, when this con-
The part of the TREC-8 Question Answering Track
text is removed, the meaning of these terms is often
evaluation in which we participated allowed 5 an-
unclear. All the articles in this collection contain
swers to be submitted, each of which could be at
datelines which often make it p ossible to automati-
most 250 bytes long. For the 198 questions in the
cally resolve such terms for the user. For these terms
evaluation, our system was able to answer 126 of
we used the dateline as a base reference for when
them, or 63.3. If answers are weighted by rank,
the article was written and then used a small set of
our mean recipro cal rank was 0.510. This compared
heuristics to determine a complete description of a
favorably with other systems; of the 20 participants,
date term. Additional terms intro duced by the more
our system ranked 4th overall.
complete description of the relative date term were
also considered to b e in that sentence. This was es-
6 Discussion and Future Work
p ecially helpful when these terms were in the query
Attending TREC-8 provided us with additional in-
and would not have matched this sentence without
sights for future work. The most signi cant of these
such pro cessing. This pro cessing was also helpful
is that in the future more attention needs to be
when presenting sentences to the user. When sen-
paid to indexing. Sp eci cally, we discovered that
tences contained relative date terms, the parts of
the search engine we used, PRISE 2.0, was signi -
the description whichwere not presentinthe rela-
cantly b elow the state-of-the-art in p erformance at
tive date term were inserted after it to improve the
the ad-ho c task. To compare its p erformance at the
user understanding of the text without context.
Question Answering task, we considered all the do c-
3.5 Lesson 5
uments in which some participant had found a cor-
While linguistic pro cessing is helpful in determining
rect answer. This is likely not the complete set of
answers to a variety of questions, some information
do cuments which contain the answer, but it serves as
needs can b e satis ed with much simpler means. For
a reasonable approximation. We then compared the
questions asking \Where is X" or \What is the cap-
numb er of these do cuments that PRISE found com-
ital of X", a go o d online dictionary will usually pro-
pared to AT&T's search engine. The result is that
vide the answer within a few keystrokes. For these
the AT&T search engine returned 146 more do cu-
twotyp es of questions, we automatically extracted
ments, over all queries, in which some system found
a set of probable answers and added these to the
the answer than the PRISE search engine. For 38
query. This improved system p erformance and did
queries, the PRISE system returned no do cuments
a b etter job of addressing the user's intentions than
in which an answer was found, while the AT&T sys-
the system without this information. Sp eci cally,
tem did this for only 35 do cuments. We should also
the dictionary provided answers with a b etter level
explore the p ossiblity of examining more than 20
of generalization than the system did without these
do cuments. This is evidenced by the fact that if all
additional query terms.
200 do cuments returned by the AT&T system are
considered, then a do cument containing an answer
4 Final System
was provided for at least 187 or the 198 queries. In a
similar vein, we also hop e to lo ok at alternate index- Our nal system examined the query and added
ing schemes such as paragraph indexing, whichwas terms from an online dictionary when applicable.
used in Southern Metho dist University's system.
7 Conclusion
Here we present a system for p erforming question
answering on a large collection of text. This sys-
tem uses a simple ranking metho d, which is aided
by determining coreference relations to add terms
to a sentence and by determining the semantic cat-
egory of the answer to exclude some sentences from
consideration. We b elieve coreference plays an im-
p ortant role in question answering, as it allows a sys-
tem to extract answers from text which refers to but
do esn't explicitly mention an entity. It also provides
a means to make text presented to the user without
its original context easier to understand. Determin-
ing the semantic category that the answer will b e in,
and the entities which fall into that category, is also
useful: it allows sentences which do not contain a
p ossible answer to be excluded from consideration.
This system p erformed well at the evaluation and we
lo ok forward to improving its p erformance for future
evaluations.
References
D. Bikel, S. Miller, R. Schwartz, and R. Weischedel.
1997. Nymble: a high-p erformance learning
name- nder. In Proceedings of the Fifth Confer-
ence on Applied Natural Language Processing.
M. Marcus, B. Santorini, and M. Marcinkiewicz.
1994. Building a large annotated corpus of en-
glish: the Penn Treebank. Computational Lin-
guistics, 192:313{330.
Adwait Ratnaparkhi. 1996. A Maximum Entropy
Part of Sp eechTagger. In Eric Brill and Kenneth
Church, editors, Conference on Empirical Meth-
ods in Natural Language Processing, Universityof
Pennsylvania, May 17-18.
Je rey C. Reynar and Adwait Ratnaparkhi. 1997.
A maximum entropy approach to identifying sen-
tence b oundaries. In Proceedings of the Fifth Con-
ference on Applied Natural Language Processing,
pages 16{19, Washington, D.C., April.
Je Reynar. 1998. Topic Segmentation: Algorithms
and Applications. Ph.D. thesis, University of
Pennsylvania.
Gerald Salton. 1989. Automatic text processing: the
transformation, analysis, and retrieval of infor-
mation by computer. Addison-Wesley Publishing
Company, Inc.
Ellen M. Vo orhees and Donna Harman. 1997.
Overview of the fth Text REtrieval Conference
TREC-5. In Proceedings of the Fifth Text RE-
trieval Conference TREC-5, pages 1{28. NIST 500-238.