Methods for Using Textual Entailment in Open-Domain

Sanda Harabagiu and Andrew Hickl Language Computer Corporation 1701 North Collins Boulevard Richardson, Texas 75080 USA [email protected]

Abstract better constrain the answers to individual ques- tions (cf. (Prager et al., 2004)). While different in Work on the semantics of questions has their approach, each of these frameworks seeks to argued that the relation between a ques- approximate the forms of semantic inference that tion and its answer(s) can be cast in terms will allow them to identify valid textual answers to of logical entailment. In this paper, we natural language questions. demonstrate how computational systems Recently, the task of automatically recog- designed to recognize textual entailment nizing one form of semantic inference – tex- can be used to enhance the accuracy of tual entailment – has received much attention current open-domain automatic question from groups participating in the 2005 and 2006 answering (Q/A) systems. In our experi- PASCAL Recognizing Textual Entailment (RTE) ments, we show that when textual entail- Challenges (Dagan et al., 2005). 1 ment information is used to either filter or As currently defined, the RTE task requires sys- rank answers returned by a Q/A system, tems to determine whether, given two text frag- accuracy can be increased by as much as ments, the meaning of one text could be reason- 20% overall. ably inferred, or textually entailed, from the mean- ing of the other text. We believe that systems de- 1 Introduction veloped specifically for this task can provide cur- Open-Domain Question Answering (Q/A) sys- rent question-answering systems with valuable se- tems return a textual expression, identified from mantic information that can be leveraged to iden- a vast document collection, as a response to a tify exact answers from ranked lists of candidate question asked in natural language. In the quest answers. By replacing the pairs of texts evaluated for producing accurate answers, the open-domain in the RTE Challenge with combinations of ques- Q/A problem has been cast as: (1) a pipeline of tions and candidate answers, we expect that textual linguistic processes pertaining to the processing entailment could provide yet another mechanism of questions, relevant passages and candidate an- for approximating the types of inference needed swers, interconnected by several types of lexico- in order answer questions accurately. semantic feedback (cf. (Harabagiu et al., 2001; In this paper, we present three different methods Moldovan et al., 2002)); (2) a combination of for incorporating systems for textual entailment language processes that transform questions and into the traditional Q/A architecture employed by candidate answers in logic representations such many current systems. Our experimental results that reasoning systems can select the correct an- indicate that (even at their current level of per- swer based on their proofs (cf. (Moldovan et al., formance) textual entailment systems can substan- 2003)); (3) a noisy-channel model which selects tially improve the accuracy of Q/A, even when no the most likely answer to a question (cf. (Echi- other form of semantic inference is employed. habi and Marcu, 2003)); or (4) a constraint sat- The remainder of the paper is organized as fol- isfaction problem, where sets of auxiliary ques- tions are used to provide more information and 1http://www.pascal-network.org/Challenges/RTE

905 Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 905–912, Sydney, July 2006. c 2006 Association for Computational Linguistics QUESTION ANSWERING SYSTEM Answers Expected Keywords Ranked List of Paragraphs Answer Type

AUTO−QUAB Question Passage Generation Answer Question Processing Retrieval Processing Module Module List of Questions Module (QP) (PR) (AP) Documents

Entailed Paragraphs List of Entailed Paragraphs TEXTUAL

ENTAILMENT Ranked List of Answers Entailed Questions

Method 1 TEXTUAL TEXTUAL ENTAILMENT Answers−M1

ENTAILMENT Answers−M2 Method 2 Method 3 Answers−M3

Figure 1: Integrating Textual Entailment in Q/A. lows. Section 2 describes the three methods of techniques. using textual entailment in open-domain question As illustrated in Figure 1, most open-domain answering that we have identified, while Section 3 Q/A systems generally consist of a sequence of presents the textual entailment system we have three modules: (1) a question processing (QP) used. Section 4 details our experimental methods module; (2) a passage retrieval (PR) module; and and our evaluation results. Finally, Section 5 pro- (3) an answer processing (AP) module. Questions vides a discussion of our findings, and Section 6 are first submitted to a QP module, which extracts summarizes our conclusions. a set of relevant keywords from the text of the question and identifies the question’s expected an- 2 Integrating Textual Entailment in swer type (EAT). Keywords – along with the ques- Question Answering tion’s EAT – are then used by a PR module to re- In this section, we describe three different meth- trieve a ranked list of paragraphs which may con- ods for integrating a textual entailment (TE) sys- tain answers to the question. These paragraphs are tem into the architecture of an open-domain Q/A then sent to an AP module, which extracts an ex- system. act candidate answer from each passage and then ranks each candidate answer according to the like- Work on the semantics of questions (Groe- lihood that it is a correct answer to the original nendijk, 1999; Lewis, 1988) has argued that the question. formal answerhood relation found between a ques- tion and a set of (correct) answers can be cast Method 1. In Method 1, each of a ranked list of in terms of logical entailment. Under these ap- answers that do not meet the minimum conditions proaches (referred to as licensing by (Groenendijk, for TE are removed from consideration and then 1999) and aboutness by (Lewis, 1988)), p is con- re-ranked based on the entailment confidence (a sidered to be an answer to a question ?q iff ?q logi- real-valued number ranging from 0 to 1) assigned cally entails the set of worlds in which p is true(i.e. by the TE system to each remaining example. The ?p). While the notion of textual entailment has system then outputs a new set of ranked answers been defined far less rigorously than logical en- which do not contain any answers that are not en- tailment, we believe that the recognition of textual tailed by the user’s question. entailment between a question and a set of candi- Table 1 provides an example where Method 1 date answers – or between a question and ques- could be used to make the right prediction for a set tions generated from answers – can enable Q/A of answers. Even though A1 was ranked in sixth systems to identify correct answers with greater position, the identification of a high-confidence precision than current keyword- or pattern-based positive entailment enabled it to be returned as the

906 top answer. In contrast, the recognition of a neg- tify a set of answer passages that contain correct ative entailment for A2 caused this answer to be answers to the original question. For example, in dropped from consideration altogether. Table 2, we find that entailment between questions Q1: “What did Peter Minuit buy for the equivalent of $24.00?” indicates the correctness of a candidate answer: Rank1 TE Rank2 Answer Text th st Q2 1 2 A1 6 YES 1 Everyone knows that, back in 1626, Peter Mi- here, establishing that entails AGQ and AGQ (0.89) nuit bought Manhattan from the Indians for $24 (but not AGQ3 or AGQ4) enables the system to se- worth of trinkets. st A2 1 NO – In 1626, an enterprising Peter Minuit flagged lect A2 as the correct answer. (0.81) down some passing locals, plied them with beads, cloth and trinkets worth an estimated When at least one of the AGQs generated by $24, and walked away with the whole island. the AutoQUAB module is entailed by the original Table 1: Re-ranking of answers by Method 1. question, all AGQs that do not reach TE are fil- Method 2. Since AP is often a resource- tered from consideration; remaining passages are intensive process for most Q/A systems, we ex- assigned an entailment confidence score and are pect that TE information can be used to limit the sent to the AP module in order to provide an ex- number of passages considered during AP. As il- act answer to the question. Following this pro- lustrated in Method 2 in Figure 1, lists of passages cess, candidate answers extracted from the AP retrieved by a PR module can either be ranked (or module were then re-associated with their AGQs filtered) using TE information. Once ranking is and resubmitted to the TE system (as in Method complete, answer extraction takes place only on 1). Question-answer pairs deemed to be posi- the set of entailed passages that the system consid- tive instances of entailment were then stored in a ers likely to contain a correct answer to the user’s database and used as additional training data for question. the AutoQUAB module. When no AGQs were Method 3. In previous work (Harabagiu et al., found to be entailed by the original question, how- 2005b), we have described techniques that can be ever, passages were ranked according to their en- used to automatically generate well-formed natu- tailment confidence and sent to AP for further pro- ral language questions from the text of paragraphs cessing and validation. retrieved by a PR module. In our current system, sets of automatically-generated questions (AGQ) 3 The Textual Entailment System are created using a stand-alone AutoQUAB gen- Processing textual entailment, or recognizing eration module, which assembles question-answer whether the information expressed in a text can be pairs (known as QUABs) from the top-ranked pas- inferred from the information expressed in another sages returned in response to a question. Table 2 text, can be performed in four ways. We can try to lists some of the questions that this module has (1) derive linguistic information from the pair of produced for the question Q2: “How hot does the texts, and cast the inference recognition as a clas- inside of an active volcano get?”. sification problem; or (2) evaluate the probability Q2: “How hot does the inside of an active volcano get?” A2 Tamagawa University volcano expert Takeyo Kosaka said lava frag- that an entailment can exist between the two texts; ments belched out of the mountain on January 31 were as hot as 300 (3) represent the knowledge from the pair of texts degrees Fahrenheit. The intense heat from a second eruption on Tuesday forced rescue operations to stop after 90 minutes. Because in some representation language that can be asso- of the high temperatures, the bodies of only five of the volcano’s initial victims were retrieved. ciated with an inferential mechanism; or (4) use Positive Entailment the classical AI definition of entailment and build AGQ1 What temperature were the lava fragments belched out of the moun- tain on January 31? models of the world in which the two texts are re- AGQ2 How many degrees Fahrenheit were the lava fragments belched out of the mountain on January 31? spectively true, and then check whether the models Negative Entailment associated with one text are included in the mod- AGQ3 When did rescue operations have to stop? AGQ4 How many bodies of the volcano’s initial victims were retrieved? els associated with the other text. Although we be- Table 2: TE between AGQs and user question. lieve that each of these methods should be inves- Following (Groenendijk, 1999), we expect that tigated fully, we decided to focus only on the first if a question ?q logically entails another question method, which allowed us to build the TE system ?q0, then some subset of the answers entailed by illustrated in Figure 2. ?q0 should also be interpreted as valid answers to Our TE system consists of (1) a Preprocess- ?q. By establishing TE between a question and ing Module, which derives linguistic knowledge AGQs derived from passages identified by the Q/A from the text pair; (2) an Alignment Module, which system for that question, we expect we can iden- takes advantage of the notions of lexical alignment

907 Classification Module Preprocessing Training Feature Extraction Corpora

Textual Alignment YES Input 1 Lexico−Semantic Coreference Alignment Module Features PoS/ NER Syntactic Concept Synonyms/ Semantic Coreference Lexical Alignment Dependency Classifier Antonyms Temporal NE Features Normalization Aliasing Paraphrase NO Textual Features Input 2 Pragmatics Paraphrase Acquisition Modality Detection Speech Act Recognition Semantic/ Pragmatic Factivity Detection Belief Recognition WWW Features

Figure 2: Textual Entailment Architecture.

Judgment Paraphrase YES lava fragments in pyroclastic flows can reach 400 degrees and textual paraphrases; and (3) a Classification YES an active volcano can get up to 2000 degrees Module, which uses a machine learning classifier NO an active volcano above you are slopes of 30 degrees YES the active volcano with steam reaching 80 degrees (based on decision trees) to make an entailment YES lava fragments such as cinders may still be as hot as 300 degrees judgment for each pair of texts. NO lava is a liquid at high temperature: typically from 700 degrees As described in (Hickl et al., 2006), the Prepro- Table 3: Phrase-Level Alternations cessing module is used to syntactically parse texts, identify the semantic dependencies of predicates, decision tree classifier in order to determine label named entities, normalize temporal and spa- whether an entailment relationship exists for each tial expressions, resolve instances of coreference, pair of texts. This classifier is learned using fea- and annotate predicates with polarity, tense, and tures extracted from the previous modules, includ- modality information. ing features derived from (1) the (lexical) align- Following preprocessing, texts are sent to ment of the texts, (2) syntactic and semantic de- an Alignment Module which uses a Maximum pendencies discovered in each text passage, (3) Entropy-based classifier in order to estimate the paraphrases derived from web documents, and (4) probability that pairs of constituents selected from semantic and pragmatic annotations. (A complete texts encode corresponding information that could list of features can be found in Figure 4.) Based on be used to inform an entailment judgment. This these features, the classifier outputs both an entail- module assumes that since sets of entailing texts ment judgment (either yes or no) and a confidence necessarily predicate about the same set of indi- value, which is used to rank answers or paragraphs viduals or events, systems should be able to iden- in the architecture illustrated in Figure 1. tify elements from each text that convey similar types of presuppositions. Examples of predicates 3.1 Lexical Alignment and arguments aligned by this module are pre- Several approaches to the RTE task have argued sented in Figure 3. that the recognition of textual entailment can be

Auto−QUAB Original Question enhanced when systems are able to identify – Pred: be temperature Pred: get hot or align – corresponding entities, predicates, or Answer Type What temperature How hot phrases found in a pair of texts. In this section, Arg1 Arg1 we show that by using a machine learning-based the lava fragments the inside of an active volcano

ArgM−LOC classifier which combines lexico-semantic infor- the mountain an active volcano mation from a wide range of sources, we are able Figure 3: Alignment Graph to accurately identify aligned constituents in pairs of texts with over 90% accuracy. Aligned constituents are then used to extract We believe the alignment of corresponding en- sets of phrase-level alternations (or “paraphrases”) tities can be cast as a classification problem which from the WWW that could be used to capture cor- uses lexico-semantic features in order to compute respondences between texts longer than individual an alignment probability p(a), which corresponds constituents. The top 8 candidate paraphrases for to the likelihood that a term selected from one text two of the aligned elements from Figure 3 are pre- entails a term from another text. We used con- sented in Table 3. stituency information from a chunk parser to de- Finally, the Classification Module employs a compose the pair of texts into a set of disjoint seg-

908 ALIGNMENT FEATURES: These three features are derived from the Alignment Classifier: (1) a set of statistical fea- results of the lexical alignment classification. 1 LONGEST COMMON STRING: This feature represents the longest tures (e.g. cosine similarity), (2) a set of lexico- contiguous  string common to both texts. 2 UNALIGNED CHUNK: This feature represents the number of semantic features (including WordNet Similar-   chunks in one text that are not aligned with a chunk from the other ity (Pedersen et al., 2004), named entity class 3 LEXICAL ENTAILMENT PROBABILITY: This feature is defined in (Glic  kman and Dagan, 2005). equality, and part-of-speech equality), and (3) a set DEPENDENCY FEATURES: These four features are computed of string-based features (such as Levenshtein edit from the PropBank-style annotations assigned by the semantic parser. distance and morphological stem equality). 1 ENTITY-ARG MATCH: This is a boolean feature which fires when aligned entities were assigned the same argument role label. As in (Hickl et al., 2006), we used a two- 2 ENTITY-NEAR-ARG MATCH: This feature is collapsing the ar-   step approach to obtain sufficient training data guments Arg1 and Arg2 (as well as the ArgM subtypes) into single categories for the purpose of counting matches. for the Alignment Classifier. First, humans were 3 PREDICATE-ARG MATCH: This boolean feature is flagged when at least two aligned arguments have the same role. tasked with annotating a total of 10,000 align- 4 PREDICATE-NEAR-ARG MATCH: This feature is collapsing the ar-   ment pairs (extracted from the 2006 PASCAL De- guments Arg1 and Arg2 (as well as the ArgM subtypes) into single categories for the purpose of counting matches. velopment Set) as either positive or negative in- PARAPHRASE FEATURES: These three features are derived from stances of alignment. These annotations were then the paraphrases acquired for each pair. 1 SINGLE PATTERN MATCH: This is a boolean feature which fired used to train a hillclimber that was used to anno- when  a paraphrase matched either of the texts. tate a larger set of 450,000 alignment pairs se- 2 BOTH PATTERN MATCH: This is a boolean feature which fired when  paraphrases matched both texts. lected at random from the training corpora de- 3 CATEGORY MATCH: This is a boolean feature which fired when par aphrases could be found from the same paraphrase cluster that scribed in Section 3.3. These machine-annotated matched both texts. examples were then used to train the Maximum SEMANTIC/PRAGMATIC FEATURES: These six features are ex- Entropy-based classifier that was used in our TE tracted by the preprocessing module. 1 NAMED ENTITY CLASS: This feature has a different value for system. Table 4 presents results from TE’s linear- each of the 150 named entity classes. 2 TEMPORAL NORMALIZATION: This boolean feature is flagged and Maximum Entropy-based Alignment Classi- when  the temporal expressions are normalized to the same ISO 9000 equivalents. fiers on a sample of 1000 alignment pairs selected 3 MODALITY MARKER: This boolean feature is flagged when the at random from the 2006 PASCAL Test Set. twotexts use the same modal verbs. 4 SPEECH-ACT: This boolean feature is flagged when the lexicons indicate  the same speech act in both texts. Classifier Training Set Precision Recall F-Measure Linear 10K pairs 0.837 0.774 0.804 5 FACTIVITY MARKER: This boolean feature is flagged when the factivity markers indicate either TRUE or FALSE in both texts simul- Maximum Entropy 10K pairs 0.881 0.851 0.866 taneously. Maximum Entropy 450K pairs 0.902 0.944 0.922 6 BELIEF MARKER: This boolean feature is set when the belief mar kers indicate either TRUE or FALSE in both texts simultaneously. Table 4: Performance of Alignment Classifier CONTRAST FEATURES: These six features are derived from the opposing information provided by antonymy relations or chains. 3.2 Paraphrase Acquisition 1 NUMBER OF LEXICAL ANTONYMY RELATIONS: This feature counts  the number of antonyms from WordNet that are discovered between the two texts. Much recent work on automatic paraphras- 2 NUMBER OF ANTONYMY CHAINS: This feature counts the num- ing (Barzilay and Lee, 2003) has used relatively berof antonymy chains that are discovered between the two texts. 3 CHAIN LENGTH: This feature represents a vector with the simple statistical techniques to identify text pas- lengths of the antonymy chains discovered between the two texts. 4 NUMBER OF GLOSSES: This feature is a vector representing the sages that contain the same information from par-   number of Gloss relations used in each antonymy chain. allel corpora. Since sentence-level paraphrases are 5 NUMBER OF MORPHOLOGICAL CHANGES: This feature is a vector representing  the number of Morphological-Derivation relations found generally assumed to contain information about in each antonymy chain. 6 NUMBER OF NODES WITH DEPENDENCIES: This feature is a vec- the same event, these approaches have generally tor indexing the number of nodes in each antonymy chain that con- tain dependency relations. assumed that all of the available paraphrases for 7 TRUTH-VALUE MISMATCH: This is a boolean feature which fired a given sentence will include at least one pair of when  two aligned predicates differed in any truth value. 8 POLARITY MISMATCH: This is a boolean feature which fired entities which can be used to extract sets of para-   when predicates were assigned opposite polarity values. phrases from text. Figure 4: Features Used in Classifying Entailment The TE system uses a similar approach to gather phrase-level alternations for each entailment pair. ments known as “alignable chunks”. Alignable In our system, the two highest-confidence en- chunks from one text (Ct) and the other text (Ch) tity alignments returned by the Lexical Alignment × are then assembled into an alignment matrix (Ct module were used to construct a query which ∈ × Ch). Each pair of chunks (p Ct Ch) is then was used to retrieve the top 500 documents from submitted to a Maximum Entropy-based classi- Google, as well as all matching instances from our fier which determines whether or not the pair of training corpora described in Section 3.3. This chunks represents a case of lexical entailment. method did not always extract true paraphrases of Three classes of features were used in the either texts. In order increase the likelihood that

909 only true paraphrases were considered as phrase- by incorporating features from TE into a Q/A sys- level alternations for an example, extracted sen- tem which employs no other form of textual infer- tences were clustered using complete-link cluster- ence, we can improve accuracy by more than 20% ing using a technique proposed in (Barzilay and over a baseline. Lee, 2003). We conducted our evaluations on a set of 500 factoid questions selected randomly from 3.3 Creating New Sources of Training Data questions previously evaluated during the annual 2 In order to obtain more training data for our TE TREC Q/A evaluations. Of these 500 questions, system, we extracted more than 200,000 examples 335 (67.0%) were automatically assigned an an- of textual entailment from large newswire corpora. swer type from our system’s answer type hierar- Positive Examples. Following an idea pro- chy ; the remaining 165 (33.0%) questions were posed in (Burger and Ferro, 2005), we created a classified as having an unknown answer type. In corpus of approximately 101,000 textual entail- order to provide a baseline for our experiments, ment examples by pairing the headline and first we ran a version of our Q/A system, known as sentence from newswire documents. In order to FERRET (Harabagiu et al., 2005a), that does not increase the likelihood of including only positive make use of textual entailment information when examples, pairs were filtered that did not share an identifying answers to questions. Results from this entity (or an NP) in common between the headline baseline are presented in Table 7. Question Set Questions Correct Accuracy MRR and the first sentence Known Answer Types 335 107 32.0% 0.3001 Judgment Example Unknown Answer Types 265 81 30.6% 0.2987 YES Text-1: Sydney newspapers made a secret deal not to report on the fawning and spending during the city’s successful bid Table 7: Q/A Accuracy without TE for the 2000 Olympics, former Olympics Minister Bruce Baird said today. Text-2: Papers Said To Protect Sydney Bid The performance of the TE system described YES Text-1: An IOC member expelled in the Olympic bribery in Section 3 was first evaluated in the 2006 PAS- scandal was consistently drunk as he checked out Stockholm’s bid for the 2004 Games and got so offensive that he was CAL RTE Challenge. In this task, systems were thrown out of a dinner party, Swedish officials said. Text-2: Officials Say IOC Member Was Drunk tasked with determining whether the meaning of a sentence (referred to as a hypothesis) could be Table 5: Positive Examples reasonably inferred from the meaning of another Negative Examples. Two approaches were sentence (known as a text). Four types of sen- used to gather negative examples for our training tence pairs were evaluated in the 2006 RTE Chal- set. First, we extracted 98,000 pairs of sequen- lenge, including: pairs derived from the output of tial sentences that included mentions of the same (1) automatic question-answering (QA) systems, named entity from a large newswire corpus. We (2) systems (IE), (3) in- also extracted 21,000 pairs of sentences linked by formation retrieval (IR) systems, and (4) multi- connectives such as even though, in contrast and document summarization (SUM) systems. The ac- but. curacy of our TE system across these four tasks is presented in Table 8. Judgment Example Training Data NO Text-1: One player losing a close friend is Japanese pitcher Development Set Additional Corpora Hideki Irabu, who was befriended by Wells during spring Number of Examples 800 201,000 training last year. QA-test 0.5750 0.6950 Text-2: Irabu said he would take Wells out to dinner when the IE-test 0.6450 0.7300 Yankees visit Toronto. ask T IR-test 0.6200 0.7450 NO Text-1: According to the professor, present methods of clean- SUM-test 0.7700 0.8450 ing up oil slicks are extremely costly and are never completely Overall Accuracy 0.6525 0.7538 efficient. Text-2: In contrast, he stressed, Clean Mag has a 100 percent Table 8: Accuracy on the 2006 RTE Test Set pollution retrieval rate, is low cost and can be recycled. In previous work (Hickl et al., 2006), we have Table 6: Negative Examples found that the type and amount of training data available to our TE system significantly (p < 0.05) 4 Experimental Results impacted its performance on the 2006 RTE Test In this section, we describe results from four sets Set. When our system was trained on the training of experiments designed to explore how textual corpora described in Section 3.3, the overall accu- entailment information can be used to enhance the racy of the system increased by more than 10%, quality of automatic Q/A systems. We show that 2Text Retrieval Conference (http://trec.nist.gov)

910 from 65.25% to 75.38%. In order to provide train- Hybrid Method. Finally, we found that the ing data that replicated the task of recognizing en- best results could be obtained by combining as- tailment between a question and an answer, we as- pects of each of these three strategies. Under this sembled a corpus of 5000 question-answer pairs approach, candidate answers were initially ranked selected from answers that our baseline Q/A sys- using features derived from entailment classifica- tem returned in response to a new set of 1000 ques- tions performed between (1) the original question tions selected from the TREC test sets. 2500 posi- and each candidate answer and (2) the original tive training examples were created from answers question and the AGQ generated from each can- identified by human annotators to be correct an- didate answer. Once a ranking was established, swers to a question, while 2500 negative examples answers that were not judged to be entailed by were created by pairing questions with incorrect the question were also removed from final rank- answers returned by the Q/A system. ing. Results from this hybrid method are provided After training our TE system on this corpus, we in Table 9. Known EAT Unknown EAT performed the following four experiments: Acc MRR Acc MRR Baseline 32.0% 0.3001 30.6% 0.2978 Method 1. In the first experiment, the ranked Method 1 44.1% 0.4114 39.5% 0.3833 lists of answers produced by the Q/A system were Method 2 52.4% 0.5558 42.7% 0.4135 Method 3 41.5% 0.4257 37.5% 0.3575 submitted to the TE system for validation. Un- Hybrid 53.9% 0.5640 41.9% 0.4010 der this method, answers that were not entailed Table 9: Q/A Performance with TE by the question were removed from consideration; the top-ranked entailed answer was then returned as the system’s answer to the question. Results 5 Discussion from this method are presented in Table 9. The experiments reported in this paper suggest Method 2. In this experiment, entailment in- that current TE systems may be able to provide formation was used to rank passages returned by open-domain Q/A systems with the forms of se- the PR module. After an initial relevance rank- mantic inference needed to perform accurate an- ing was determined from the PR engine, the top swer validation. While probabilistic or web-based 50 passages were paired with the original question methods for answer validation have been previ- and were submitted to the TE system. Passages ously explored in the literature (Magnini et al., were re-ranked using the entailment judgment and 2002), these approaches have modeled the rela- the entailment confidence computed for each pair tionship between a question and a (correct) answer and then submitted to the AP module. Features in terms of relevance and have not tried to approx- derived from the entailment confidence were then imate the deeper semantic phenomena that are in- combined with the keyword- and relation-based volved in determining answerhood. features described in (Harabagiu et al., 2005a) in Our work suggests that considerable gains in order to produce a final ranking of candidate an- performance can be obtained by incorporating TE swers. Results from this method are presented in during both answer processing and passage re- Table 9. trieval. While best results were obtained using Method 3. In the third experiment, TE was used the Hybrid Method (which boosted performance to select AGQs that were entailed by the question by nearly 28% for questions with known EATs), submitted to the Q/A system. Here, AutoQUAB each of the individual methods managed to boost was used to generate questions for the top 50 can- the overall accuracy of the Q/A system by at least didate answers identified by the system. When at 7%. When TE was used to filter non-entailed an- least one of the top 50 AGQs were entailed by swers from consideration (Method 1), the over- the original question, the answer passage associ- all accuracy of the Q/A system increased by 12% ated with the top-ranked entailed question was re- over the baseline (when an EAT could be iden- turned as the answer. When none of the top 50 tified) and by nearly 9% (when no EAT could AGQs were entailed by the question, question- be identified). In contrast, when entailment in- answer pairs were re-ranked based on the entail- formation was used to rank passages and candi- ment confidence, and the top-ranked answer was date answers, performance increased by 22% and returned. Results for both of these conditions are 10% respectively. Somewhat smaller performance presented in Table 9. gains were achieved when TE was used to select

911 amongst AGQs generated by our Q/A system’s shop on Empirical Modeling of Semantic Equiva- AutoQUAB module (Method 3). We expect that lence and Entailment, Ann Arbor, USA. by adding features to TE system specifically de- Jeroen Groenendijk. 1999. The logic of interrogation: signed to account for the semantic contributions Classical version. In Proceedings of the Ninth Se- of a question’s EAT, we may be able to boost the mantics and Linguistics Theory Conference (SALT performance of this method. IX), Ithaca, NY. Sanda Harabagiu, Dan Moldovan, Marius Pasca, Rada 6 Conclusions Mihalcea, Mihai Surdeanu, Razvan Bunsecu, Rox- ana Girju, Vasile Rus, and Paul Morarescu. 2001. In this paper, we discussed three different ways The Role of Lexico-Semantic Feedback in Open- that a state-of-the-art textual entailment system Domain Textual Question-Answering. In Proceed- could be used to enhance the performance of an ings of the 39th Meeting of the Association for Com- putational Linguistics. open-domain Q/A system. We have shown that when textual entailment information is used to ei- S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, ther filter or rank candidate answers returned by a A. Hickl, and P. Wang. 2005a. Employing Two Question Answering Systems in TREC 2005. In Q/A system, Q/A accuracy can be improved from Proceedings of the Fourteenth Text REtrieval Con- 32% to 52% (when an answer type can be de- ference. tected) and from 30% to 40% (when no answer Sanda Harabagiu, Andrew Hickl, John Lehmann, and type can be detected). We believe that these results Dan Moldovan. 2005b. Experiments with Inter- suggest that current supervised machine learning active Question-Answering. In Proceedings of the approaches to the recognition of textual entailment 43rd Annual Meeting of the Association for Compu- may provide open-domain Q/A systems with the tational Linguistics (ACL’05). inferential information needed to develop viable Andrew Hickl, John Williams, Jeremy Bensley, Kirk answer validation systems. Roberts, Bryan Rink, and Ying Shi. 2006. Rec- ognizing Textual Entailment with LCC’s Ground- 7 Acknowledgments hog System. In Proceedings of the Second PASCAL Challenges Workshop. This material is based upon work funded in whole David Lewis. 1988. Relevant Implication. Theoria, or in part by the U.S. Government and any opin- 54(3):161–174. ions, findings, conclusions, or recommendations expressed in this material are those of the authors Bernardo Magnini, Matteo Negri, Roberto Prevete, and Hristo Tanev. 2002. Is it the right answer? ex- and do not necessarily reflect the views of the U.S. ploiting web redundancy for answer validation. In Government. Proceedings of the Fortieth Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA. References Dan Moldovan, Marius Pasca, Sanda Harabagiu, and Regina Barzilay and Lillian Lee. 2003. Learning Mihai Surdeanu. 2002. Performance Issues and Er- to paraphrase: An unsupervised approach using ror Analysis in an Open-Domain Question Answer- multiple-sequence alignment. In HLT-NAACL. ing System. In Proceedings of the 4Oth Meeting of John Burger and Lisa Ferro. 2005. Generating an En- the Association for Computational Linguistics. tailment Corpus from News Headlines. In Proceed- Dan Moldovan, Christine Clark, Sanda Harabagiu, ings of the ACL Workshop on Empirical Modeling of and Steve Maiorano. 2003. COGEX: A Logic Semantic Equivalence and Entailment, pages 49–54. Prover for Question Answering. In Proceedings of HLT/NAACL-2003. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL Recognizing Textual Entail- T. Pedersen, S. Patwardhan, and J. Michelizzi. 2004. ment Challenge. In Proceedings of the PASCAL WordNet::Similarity - Measuring the Relatedness of Challenges Workshop. Concepts. In Proceedings of the Nineteenth Na- tional Conference on Artificial Intelligence (AAAI- Abdessamad Echihabi and Daniel Marcu. 2003. A 04), San Jose, CA. noisy-channel approach to question answering. In Proceedings of the 41st Meeting of the Association John Prager, Jennifer Chu-Carroll, and Krzysztof for Computational Linguistics. Czuba. 2004. Question answering using con- straint satisfaction: Qa-by-dossier-with-contraints. Oren Glickman and Ido Dagan. 2005. A Probabilistic In Proceedings of the ACL-2004, pages 574–581, Setting and Lexical Co-occurrence Model for Tex- Barcelona, Spain, July. tual Entailment. In Proceedings of the ACL Work-

912