SIGIR 2007 Proceedings Session 14:

Structured Retrieval for Question Answering

Matthew W. Bilotti, Paul Ogilvie, Jamie Callan and Eric Nyberg Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 { mbilotti, pto, callan, ehn }@cs.cmu.edu

ABSTRACT keywords in a minimal window (e.g., Ruby killed Oswald is Bag-of- retrieval is popular among Question Answer- not an answer to the question Who did Oswald kill?). ing (QA) system developers, but it does not support con- Bag-of-words retrieval does not support checking such re- straint checking and ranking on the linguistic and semantic lational constraints at retrieval time, so many QA systems information of interest to the QA system. We present an simply query using the key terms from the question, on the approach to retrieval for QA, applying structured retrieval assumption that any document relevant to the question must techniques to the types of text annotations that QA systems necessarily contain them. Slightly more advanced systems use. We demonstrate that the structured approach can re- also index and provide query operators that match named trieve more relevant results, more highly ranked, compared entities, such as Person and Organization [14]. When the with bag-of-words, on a sentence retrieval task. We also corpus includes many documents containing the keywords, characterize the extent to which structured retrieval effec- however, the ranked list of results can contain many irrel- tiveness depends on the quality of the annotations. evant documents. For example, question 1496 from TREC 2002 asks, What country is Berlin in? A typical query for this question might be just the keyword Berlin,orthekey- Categories and Subject Descriptors Berlin in proximity to a Country named entity. A H.3.1 [Information Storage and Retrieval]: Content queryonthekeywordalonemayretrievemanydocuments, Analysis and Indexing; H.3.3 [Information Storage and most of which may not contain the answer. A query that Retrieval]: Information Storage and Retrieval includes the named entity constraint is more accurate, but still matches sentences such as Berlin is near Poland. General Terms Many QA systems adopt the approach of retrieving large sets of documents using bag-of-words retrieval with question Algorithms keywords, then exhaustively post-processing to check the higher-level constraints that indicate a likely answer. One Keywords criticism of this approach is that it is difficult to know apri- ori how many documents to retrieve; even a very large set of Structured retrieval, question answering results may not contain an answer if the key terms are very common. A second criticism is that it can be slow; though 1. INTRODUCTION they aim for a ‘large enough’ set of documents, systems may Modern Question Answering (QA) systems often repre- retrieve and process more text than is really required to an- sent the information required to answer a question using swer a question. Interactive systems are particularly vulner- linguistic and semantic constraints. For example, Lin and able, as they must balance the amount of processing against Hovy [8] and Harabagiu et. al. [6] have developed topic thenumberofresultsprocessedwhiletheuserwaits. representations (or topic signatures) which encode not only In situations where complex corpus analysis can be done question keywords, but also relations among them, includ- off-line, another potential bottleneck arises. In a QA sys- ing the syntactic relations such as subject and object that tem, downstream processing after the retrieval stage can hold between verbs and their arguments. The use of these include , answer merging and justi- relations can help a system to recognize that a passage does fication, and presentation. The processing cost scales with not answer a given question, even if it contains the correct the number of retrieved results. Bag-of-words may retrieve a large number of documents that do not contain answers. It may therefore be necessary to process many results before a satisfactory answer is found. If documents could be ranked Permission to make digital or hard copies of all or part of this work for based on the linguistic and semantic constraints of interest personal or classroom use is granted without fee provided that copies are to the QA system, fewer results would need to be processed not made or distributed for profit or commercial advantage and that copies downstream, potentially improving system efficiency. bear this notice and the full citation on the first page. To copy otherwise, to This paper explores the application of structured retrieval republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. to the problem of retrieval for QA. Structured retrieval tech- SIGIR’07, July 23–27, 2007, Amsterdam, The Netherlands. niques found success in systems using Boolean logic or doc- Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00.

351 SIGIR 2007 Proceedings Session 14: Question Answering

ument structure, or some combination of the two. Examples question. It has been argued in the QA literature that re- include web document retrieval [12] and XML element re- trieval for QA and ad-hoc retrieval are different tasks with trieval [3]. This work is a logical extension of these ideas. different requirements; while many ad-hoc retrieval evalua- In our approach, linguistic and semantic content of interest tions place roughly equal emphasis on both precision and to the QA system is modeled as text annotations, which are recall, retrieval for QA should be optimized for maximum represented as fields in the index. With a retrieval model recall of relevant documents [2, 9]. This gives imperfect that supports constraint-checking and ranking with respect answer extraction technology and redundancy-based answer to document structure and annotations as well as keywords, validation the best chance of selecting the correct answer. a QA system can directly express higher-level constraints on In this work, we consider two QA systems based on differ- answer structures that can be checked at retrieval time. ent retrieval approaches. Each system is built on a pipelined We hypothesize that structured retrieval can improve re- architecture that begins with a question analysis module trieval performance for QA, compared to the bag-of-words that analyzes input questions and formulates queries, which approach, by retrieving more relevant documents that are are later executed by the retrieval system. Retrieved text more highly ranked, subject to certain environmental condi- goes through downstream processing, which includes con- tions: a) for particular types of questions; b) for particular straint checking and answer extraction. The unit of retrieval corpora; and c) for particular sets of annotations. Struc- for the remainder of this paper will be individual sentences. tured retrieval may also boost overall system efficiency by retrieving fewer irrelevant results that must be processed. 3.1 System A: Bag-of-Words Approach In this paper, we present an approach to retrieval for QA System A uses a bag-of-words retrieval approach, simi- based on an application of structured retrieval techniques, lar to [14], in which the queries contain question keywords along with an empirical evaluation that supports our hy- and an operator that matches named entities correspond- potheses. Section 2 gives a brief overview of related work. ing to the expected answer type. This approach assumes Section 3 discusses the retrieval needs of QA systems in more only that the corpus has been annotated with named entity detail, and poses interesting questions about the nature of occurrences, which have been indexed along with the key- the problem of retrieval for QA. Section 4 describes a series words for use at query time. System A must process each of experiments designed to yield a better understanding of retrieved sentence with a semantic parser so that it has the the issues raised by these questions. The results are pre- information required for constraint checking and answer ex- sented and discussed in Sections 5 and 6. A summary of the traction. Any sentence for which the semantic parser fails is contributions of the work thus far can be found in Section 7. discarded, as the system can not extract answers from it. A more interactive variant of the System A architecture uses a 2. RELATED WORK pre-annotated corpus to avoid on-the-fly semantic , Considerable past research investigated whether concept- but these analyses are not used at retrieval time; constraints based, controlled vocabulary, or logical-form document rep- are still checked as a part of the downstream processing. resentations provide better retrieval accuracy than full-text representations. Although some communities continue to 3.2 System B: Structured Approach study this issue, our interpretation of past research is that, The structured retrieval approach used by System B re- in general, a combination of representations is more effective quires that the corpus has been pre-processed with a seman- than any single representation alone [12, 15]. tic parser and a named entity tagger. The semantic analy- For many QA applications, concept-based and controlled ses are represented as text annotations, which are indexed vocabulary representations are not available. Most QA sys- along with the named entities and the keywords. System B’s tems seek to gain some of the advantages of concept-based question analysis module analyzes the question using the se- and logical-form representations by extending a full-text rep- mantic parser, mapping the resulting structure into one or resentation with text annotations that represent higher-level more posited answer-bearing structures. These structures analysis, e.g. named entities, as in [14], or syntactic struc- are expressed as sets of constraints on semantic annotations ture. Such a representation provides somewhat stronger se- and keywords, and translate directly into structured queries. mantics than a representation based on stemmed words. System B executes separate queries for each structure and An alternate approach is the two-step retrieval process fa- merges the lists of results. Constraint checking is performed vored by many QA systems, which involves using a sliding using the retrieved sentences’ pre-cached semantic analyses. window passage scoring algorithm to choose the best pas- sages from the top-n documents matching a bag-of-words 3.3 Research Questions query. Recent work has moved from density-based scor- When thinking about how to compare Systems A and B, ing, in which scores are boosted for proximate keyword oc- several interesting questions come to mind. How does the currences [19], to scoring based on term dependency rela- effectiveness of the structured approach compare to that of tions [4]. This work, however, still relies on an initial re- bag-of-words? Does structured retrieval effectiveness vary trieval step; the dependency relations are used only to re- with question complexity? To what degree is the effective- rank the top 200 documents. In contrast, the structured ness of structured retrieval dependent on the quality of the retrieval approach scores the entire index at query time. annotations? The following sections present a series of ex- periments designed to help address these questions. 3. RETRIEVAL APPROACHES FOR QUESTION ANSWERING 4. EXPERIMENTAL METHODOLOGY In many QA systems, retrieval is used as a coarse, first- This section describes the retrieval system, and the method- pass filter to narrow the search space for answers to the ology behind the two experiments discussed in this paper.

352 SIGIR 2007 Proceedings Session 14: Question Answering

4.1 Retrieval System Question 1402 What year did Wilt Chamberlain score 100 points? Both experiments use the Indri [18] retrieval system, a part of the open-source Lemur toolkit1. Indri uses a vari- #combine[sentence]( #any:date wilt chamberlain ant of the Inference Network Model [20] to rank extents of score 100 point ) text, in which belief estimates are based on statistical lan- Figure 1: Bag-of-words query formulation example. guage models [22] instead of the Okapi ranking function. The index structures were extended to support query-time constraint-checking of hierarchical relationships among arbi- from TREC 2002 for which exhaustive, human relevance trary overlapping fields, but the retrieval model is the same judgments over the AQUAINT corpus are available [2, 9]. as in the release version. From these document-level judgments, we derived sentence- 4.2 Answer-Bearing Sentence Retrieval level judgments by manually identifying the answer-bearing sentences within the relevant documents. For each ques- We hypothesize that structured retrieval is capable of re- tion, every sentence in a relevant document that matched trieving more relevant sentences at higher ranks, compared the question’s TREC-provided answer pattern2 was manu- with bag-of-words. To test this hypothesis, we set up a com- ally judged answer-bearing or not according to the defini- parison between Systems A and B, running the same set of tion. The 109 questions were randomly split into roughly questions over the same corpus and measuring recall of rel- equal sized training (55) and testing (54) sets, keeping the evant sentences after the retrieval step. Note that, because distribution of answer type constant across the two sets. the bag-of-words approach is ranking sentences, it will be- have very similarly to a proximity-based retrieval function, 4.2.3 Query Formulation which is a strong baseline for QA applications. System A’s question analysis module constructs bag-of- The relevant sentences in this evaluation are known as words queries by filtering stopwords, the remain- answer-bearing sentences, which are defined as completely ing terms and adding them to a #combine[sentence] query containing the answer without requiring inference over the clause. The clause also contains an #any:type operator that document. For example, consider the question, Who killed matches any occurrence of a named entity representing the Kennedy? Even in a document describing the assassina- expected answer type, when this can be determined. Fig- tion, the sentence Oswald killed him can not be considered ure 1 shows an example of bag-of-words query formulation. answer-bearing, because it requires anaphora resolution to System B’s question analysis module maps the input ques- understand that him refers to Kennedy. A sentence such tion into a set of structured queries likely to retrieve answer- as Oswald assassinated Kennedy is answer-bearing, because bearing sentences for that question. This mapping is a three- we do allow for synonymy. Similarly, Everest is the tallest step operation. First, the verb predicate-argument struc- mountain would be considered an answer bearing sentence ture of the question is analyzed. The question structure is for the question, What is the highest point on Earth? then mapped into a set of posited answer-bearing structures. For a given QA system, it is reasonable to assume that This mapping can be trained using hand-labeled answer- there are some answer-bearing sentences that the answer ex- bearing structures, as shown in Figure 2, step 1. Finally, a traction component can not use, perhaps because the struc- query is formulated from each of these posited structures. ture is too complex. As a result, this experiment exam- The process for creating a structured query from a posited ines two extreme conditions. The single structure case holds structure is illustrated in Figure 2, steps 2 and 3. Named en- when only one type of answer-bearing sentence is usable. In tities corresponding to question keywords are converted into the every structure case, all answer-bearing sentences are us- clauses of the form #max( #combine[type]( keywords )). able. We expect that, in practice, a given system will handle The named entity expected answer type is converted into an some but not all types of answer-bearing sentences, and so #any:type operator. The query components are assembled recall will fall somewhere between the extremes reported. recursively from the bottom-up. Arguments become clauses 4.2.1 Corpus Preparation of the form #max( #combine[./arg]( contained clauses )), containing keywords, named entity clauses, and nested tar- This experiment uses the AQUAINT corpus, which has gets, if any. The ./arg extension to the query language been used in several past TREC QA track evaluations. The checks that the arg annotation is a child of the annotation AQUAINT corpus contains 375 million words of newswire in the #combine that contains this query clause. Targets text in English published by the Associated Press, the New become clauses of the form #max( #combine[target]( key- York Times and the Xinhua News Agency from 1996 through words, clauses of child arguments )). 2000 [5]. The corpus was annotated with MXTerminator [16] for sentence segmentation, BBN Identifinder [1] for named 4.2.4 Procedure entity recognition and ASSERT [13] version 0.11c, a shal- In the single structure case, the QA system is searching low semantic parser. ASSERT identifies verb predicate- for a specific answer-bearing structure known to be usable argument structures and labels the arguments with seman- for answer extraction, and formulates an explicit query to tic roles in the style of PropBank [7]. When indexing verb retrieve it. To model this situation, each answer-bearing predicate-argument structures in Indri, the verb, labeled tar- structure posited by the question analysis module is consid- get by ASSERT, is indexed as the parent of the arguments. ered a unique information need, where only sentences that 4.2.2 Topics and Judgments match that structure are treated as relevant. There are 369 such structures in the training set, and 250 in the test set. For this experiment, we use a set of 109 factoid questions 2See: http://trec.nist.gov/data/qa/2002 qadata/ 1See: http://www.lemurproject.org main task QAdata/patterns.txt

353 SIGIR 2007 Proceedings Session 14: Question Answering

SENTENCE will likely fall somewhere between the two extremes in terms TARG of the number of structures generated by question analysis.

ARGM−TMP ARG0 ARG1 ARGM−LOC 4.3 The Effect of Annotation Quality DATE PERSON ORG To measure the effect of annotation quality on structured On March 2, 1962, Chamberlain scored a record 100 points in a game against the New York Knicks. retrieval performance, we ran two separate comparisons be- Structure of an answer bearing sentence tween Systems A and B running the same task on the same

SENTENCE corpus, with the second run using degraded, or less accu- rate, annotations. We measure the difference between bag- TARG of-words and structured retrieval, with and without the de- ARGM−TMP ARG0 scored ARG1 graded annotations, in terms of recall of relevant sentences. 100 points DATE PERSON 4.3.1 Corpus Preparation Chamberlain This experiment uses a portion of the Penn 1) Locate keywords and answer string containing one million words of Wall Street Journal text published in 1989 [10]. The text is hand-labeled for sen- SENTENCE tence boundaries, parts-of-speech and syntactic structure. TARG We prepared two different versions of this corpus with vary- ARGM−TMP ARG0 scored ARG1 ing quality of the annotations. The version we refer to as #any:date#max( #combine[person]( 100 points WSJ GOLD contains gold-standard verb predicate-argument chamberlain structures and semantic role labels for verb arguments from )) PropBank [7]. The other version is referred to as WSJ DE- 2) Convert named entities GRADED and has semantic role annotations provided by #combine[sentence]( ASSERT. Although ASSERT was trained on the WSJ GOLD SENTENCE #max( #combine[target]( score corpus, we have found that it is only 88.8% accurate in terms TARG #max( #combine[./argm−tmp]( #max( #combine[./argm−tmp]( #any:date )) of number of arguments correctly identified and labeled. #any:date )) scored #max( #combine[./arg0]( Gold-standard syntactic analysis was not made available to #max( #combine[person]( #max( #combine[./arg0]( chamberlain )))) ASSERT as it annotated the WSJ DEGRADED corpus. #max( #combine[person]( #max( #combine[./arg1]( #max( #combine[./arg1]( chamberlain )))) 100 point )) 100 point ))))) 4.3.2 Topics and Judgments 3) Convert arguments Final structured query There are no sets of questions with associated relevance judgments readily available for use with the WSJ corpus, Figure 2: Identification of relevant structures (step but the corpus is small enough that we were able to use the 1), and the construction of a structured query (steps automatic procedure described in this section to generate 2 and 3) from a relevant structure. Example is an exhaustive list of questions answerable over the corpus, for Question 1402: What year did Wilt Chamberlain with their associated sentence-level relevance judgments. score 100 points? For each sentence in the WSJ corpus, PropBank [7] con- tains zero or more hand-annotated predicate-argument struc- tures, each of which contains zero or more arguments. These In the every structure case, the QA system considers every structures were grouped first by the predicate verb, and then answer-bearing structure useful for a particular question. In by combinations of arguments shared in common. Each this case, either the results or the structured queries must group is defined by a particular verb, and a particular set be merged. We tried several standard meta-search tech- of arguments. A sentence containing s predicate-argument niques including combMNZ and combSUM [17], Recipro- structures, where the i-th structure has ai arguments, will cal Rank [12] and Round Robin [21], as well as methods of appear in g groups: „ « query combination using an outer #combine[sentence] or Xs Xai ai a Boolean OR operator. Round Robin was found to be the g = j best, even when compared with combSUM and combMNZ i=1 j=1 using common score normalization approaches [11]. The cause may be that scores of results retrieved by very dif- Consider the following predicate-argument structure from ferent queries are not always directly comparable. The ex- document WSJ 0427, expressed in ASSERT’s notation: periments described in Section 5.1 will use only the Round Robin meta-search technique. To be fair, the merging tech- [Arg0 Dow Jones][Target publishes ][Arg1 The Wall Street nique applied to the structured queries is also applied to the Journal, Barron’s magazine, other periodicals and commu- keyword queries. This preserves any advantages the bag-of- nity newspapers] words approach might receive even though, in practice, it would likely only use one query to retrieve the results. This sentence’s two-argument structure puts it into three While it is less realistic than the single structure case, groups corresponding to the questions it can answer: the every structure experiment is valuable, as it helps char- acterize the performance of the structured approach in the 1. [Arg1 What] does [Arg0 Dow Jones][Target publish]? limiting case in which question analysis will produce every 2. [Arg0 Who][Target publishes][Arg1 The Wall Street possible structure prior to retrieval. These two cases form Journal, Barron’s magazine, other periodicals and com- the end-points of a space of possibilities. Most QA systems munity newspapers]?

354 SIGIR 2007 Proceedings Session 14: Question Answering

3. Does [Arg0 Dow Jones][Target publish][Arg1 The sentences that can not be analyzed with ASSERT, as an- Wall Street Journal, Barron’s magazine, other period- swers can not be extracted from them. The reduced judg- icals and community newspapers]? ments model this situation by not assigning relevance to sentences for which the ASSERT annotations are missing or These groups also contain other sentences that have simi- malformed in the WSJ DEGRADED corpus. lar structures. For example, Group 1 contains two structures also having the Target publish and the Arg0 Dow Jones. Once the grouping is complete, each group contains only 5. EXPERIMENTAL RESULTS and all of the sentences in the corpus containing predicate- In this section, the experimental results comparing bag-of- argument structures that answer a particular question. Each words and structured retrieval, and measuring the effect of group, then, constitutes exhaustive sentence-level relevance degraded annotations on structured retrieval, are presented. judgments for its corresponding question. Of the questions generated by this method, 96.3% had only one relevant sen- 5.1 Answer-Bearing Sentence Retrieval tence. These questions were discarded, as they would have skewed the averages, leaving 10,690 questions with associ- ated judgments. This set of topics and judgments is known Single Structure Every Structure Extreme Extreme as the original judgments. 1.0 1.0 The experiment also uses a reduced judgment set, created 0.8 0.8 by removing relevant sentences for which the degraded an- 0.6 0.6 notations do not match the gold-standard annotations from Bag−of−words Bag−of−words PropBank. This occurs because ASSERT can omit, mis- 0.4 Structured 0.4 Structured Difference Difference label or incorrectly identify the boundaries of an annotation. 0.2 0.2 Sentence Recall Sentence Recall As an example, consider the following gold-standard Prop- 0.0 0.0 Bank structure from WSJ 0003: 0 200 600 1000 0 200 600 1000 (a) Rank (b) Rank [Arg1 Aformofasbestos][Argm-Tmp once][Target used] Single Structure Every Structure [Arg2-Pnc to make Kent cigarette filters] Extreme Extreme 1.0 1.0 Using the degraded annotations, the structure is: 0.8 0.8 0.6 0.6 Arg0 Argm-Tmp Target Bag−of−words Bag−of−words [ Aformofasbestos][ once][ used] to 0.4 Structured 0.4 Structured Difference Difference make Kent cigarette filters 0.2 0.2 Sentence Recall Sentecne Recall 0.0 0.0 ASSERT has fundamentally altered the meaning of the 0 200 600 1000 0 200 600 1000 annotation by considering aformofasbestosto be the Arg0, (c) Rank (d) Rank or agent, of the verb used as opposed to the Arg1,orpa- tient. ASSERT has also missed the Arg2-Pnc (purpose, not Figure 3: Recall of answer-bearing sentences for cause). When ASSERT omits or mis-labels an argument re- both bag-of-words and structured retrieval ap- quired by a particular question, the sentence that contains proaches on the AQUAINT corpus. Training topics the error is removed from the reduced set of judgments. are above and test topics are below. Errors in bounding are tolerated if the gold-standard and degraded versions of the annotation overlap. Otherwise, it is considered an omission. In the reduced judgments, 802 of This section describes the experimental results compar- the 10,690 topics have no relevant sentences at all. ing the bag-of-words and structured retrieval approaches, The reduced judgments model the reality of QA system, in terms of recall of answer-bearing sentences, for both the in which the relevance of a sentence can not be determined single structure and every structure experimental conditions. if the annotations are missing or incorrect. The QA system All of the experimental runs described in this section use the can not use a sentence for which the analysis fails. Even if Jelinek-Mercer smoothing method [22] in Indri, where the it contains an answer, the system can not find it, because weights on the collection, document and sentence language constraint checking and answer extraction both depend on models are 0.2, 0.2 and 0.6, respectively. These weights were the analysis. chosen to optimize recall at rank 1000 for both approaches on the training topics. Smoothing using Dirichlet priors [22] 4.3.3 Query Formulation was considered, and found to be sub-optimal. Query formulation for both bag-of-words and structured Figure 3 shows the recall of the two approaches on the retrieval follows the procedure described in Section 4.2.3. AQUAINT corpus. We see from Figures 3a and 3c that, for the single structure case, the structured retrieval approach 4.3.4 Procedure has much higher recall than bag-of-words. For example, at Systems A and B are each run twice on the set of 10,690 rank 200, structured retrieval provides a 96.9% and 46.6% questions. The first run uses the WSJ GOLD corpus and the improvement over the bag-of-words approach on the train- original judgments, while the second run uses the WSJ DE- ing and test topics, respectively. Figures 3b and 3d compare GRADED corpus and the reduced judgments. The accuracy the two approaches for the every structure case. The advan- of the bag-of-words retrieval approach is not affected by de- tage that structured retrieval enjoys over bag-of-words is less grading the annotations, since it completely ignores them pronounced, but the improvement at rank 200 is still 12.8% when ranking results. System A, however, has no use for and 11.4% for the training and test topics, respectively.

355 SIGIR 2007 Proceedings Session 14: Question Answering

Training Queries structure of the answer-bearing sentences is quite complex. # Queries This necessarily widens the confidence intervals of our esti- mates on the right side of the plot. Despite the difficulties of 21 49 50 46 66 40 28 27 20 9 11 the test set, the structured run consistently performs better 1.0 than the bag-of-words run; in many cases there is little or 0.8 no overlap of the confidence intervals on our estimates. 0.6 0.4 5.2 The Effect of Annotation Quality 0.2 This section describes the results of the annotation quality experiment, comparing the structured and bag-of-words re- 0.0 Keyword+NE Structured

Sentence Recall at 200 trieval approaches, with and without degraded annotations. 012345678910 All experimental runs described in this section used the op- (a) # Structural Constraints timized smoothing parameters detailed in Section 5.1. Figures 5a and 5b demonstrate that the structured queries Test Queries are quite adept at retrieving most of the relevant items at # Queries early ranks, and that the bag-of-words queries take much 11 33 54 40 44 21 19 10 8 3 1 longer to achieve a similar level of recall, at ranks beyond 1.0 the horizon of the plot. For a QA system that needs to 0.8 achieve a recall level of, say, 0.90, perhaps for an answer validation scheme based on redundancy, the system could 0.6 process the top few results of a structured query, or process 0.4 all the way down to rank 50 for a bag-of-words query. For 0.2 situations in which post-processing of results is expensive, it 0.0 Keyword+NE Structured may be more advantageous to choose structured retrieval. Sentence Recall at 200 012345678910 The difference between Figures 5a and 5b is subtle and lies in the scale of the vertical axes. The curves in Fig- (b) # Structural Constraints ure 5b are similar to those in Figure 5a, but are shifted down slightly. Despite this shift, the relative difference between Figure 4: Recall at rank 200, varying structural the structured and bag-of-words approaches remains almost complexity of answer-bearing sentences. Results are the same. This trend is most easily seen in Figure 5c, which shown for the single structure case, where a separate plots the difference curves from both Figures 5a and 5b on query is formulated for each answer-bearing struc- a common set of axes. This shows that the performance ad- ture for each question. Structural complexity is es- vantage that structured retrieval holds over bag-of-words is timated by counting the number of #combine opera- stable when the annotations are degraded, within the con- tors in the structured query. Points show estimated text of a QA system that considers a sentence relevant if average recall, with error bars covering a 95% con- and only if it is both answer-bearing and analyzable. fidence interval. The number of queries used for an estimate are listed along the top of the plot. 6. DISCUSSION In the previous sections, we performed experiments that compare System A’s bag-of-words retrieval approach and the Figure 3 shows that structured retrieval has superior re- structured retrieval approach used by System B. We wanted call of answer-bearing sentences on average, compared to to understand if retrieval accuracy can be improved by using the bag-of-words approach, but tells little about the types structured retrieval and, if so, how the benefit depends on of queries for which structured retrieval is most helpful. In the complexity of the query structures and the quality of the an attempt to isolate the conditions where structure is im- annotations. Structured retrieval clearly performs well at its portant, Figure 4 shows the average recall at rank 200 for intended purpose; leveraging document annotations, where queries used in the single structure experiment having be- available, to retrieve sentences satisfying certain constraints, tween zero and ten structural constraints. The number of and backing off gracefully to a keyword match where no an- structural constraints is defined as the number of #combine notations are available. Results in Section 5.1 have demon- operators in the query, not counting the outer #combine strated that structured retrieval yields an improvement in [sentence] operator. The confidence intervals were con- recall of answer-bearing sentences, compared with bag-of- structed by simulating the distribution of average recall by words retrieval. Structured retrieval performs best when averaging samples from a beta distribution fitted to the ob- query structures anticipate answer-bearing structures, and served recall values using the method of moments estimator. when these structures are complex. For both the training and testing sets, it appears that One might have expected structured retrieval to be much the more complex the structure sought, the more useful better than bag-of-words for certain questions such as, What knowledge of that structure is in ranking answer-bearing country is Berlin in? or Who did Oswald kill?. Questions sentences. Figure 4a also suggests that the performance of such as these contain very common keywords that co-occur the bag-of-words approach is largely unaffected by the com- frequently. As a result, the keyword query should retrieve plexity of structure containing the keywords in the answer- large numbers of documents that contain the correct terms, bearing sentences. The results are less clear on the test but not the structure that answers the question. We did not topics, shown in Figure 4b. One factor is that there are observe this in our experiments. When we examined the rea- fewer queries in the test set, especially for queries where the sons why, we discovered a hidden bias in our question set; it

356 SIGIR 2007 Proceedings Session 14: Question Answering

(a) WSJ_GOLD with Original Judgments (b) WSJ_DEGRADED with Reduced Judgments (c) Structured/Bag-of-words Difference 1 1 0.2 GOLD/Original 0.8 0.8 DEGRADED/Reduced 0.15 0.6 Bag-of-words 0.6 Bag-of-words Structured Structured 0.1 0.4 Difference 0.4 Difference 0.2 0.2 0.05 Sentence Recall Sentence Recall

0 0 Difference in Recall 0 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Rank Rank Rank

Figure 5: Average recall through rank 100, showing the difference between the structured and bag-of-words retrieval approaches on (a) WSJ GOLD with original judgments and (b) WSJ DEGRADED with reduced judgments. Both difference curves are shown in (c).

contains none of this type of question. To make the human cost for the bag-of-words approach. We would then prefer assessment task tractable, questions such as What country to use the structured retrieval approach if is Berlin in? were intentionally excluded from considera- tion. This unfortunately biases the results in a way that ad- Cs(l) ns(l). The variables Rs, nb(l), and as bag-of-words. For a question with the average number of ns(l) are dependent on the nature of the question answering 6.22 structures, structured retrieval is 12.28 times as slow as system, the corpus, and the questions. In this section, we bag-of-words. The structured approach, however, can offer assume that Rb does not depend on the number of candidate an improvement in the sum total of retrieval time and answer answer structures, even though we allowed multiple query extraction time. Extraction time is reduced because ques- variants in Section 5.1. This more accurately models the tion analysis only posits answer-bearing structures that are costs of bag-of-words retrieval, which would likely use only easy to extract answers from. For the structures described one query. We now consider specific observed experimental in Section 4.2.3, answer extraction is a constant-time op- values for a desired recall level of l =0.75. eration that involves plucking the value of the appropriate Every structure case: On the test topics, assuming all argument or named entity field from the structure. answer-bearing sentences for a question are relevant, nb(l)= The developer of a QA system could analyze the costs of 393, ns(l) = 209, Rs = 174 seconds and Rb = 15 seconds, the two retrieval approaches to decide which would achieve on average. It would therefore be more efficient to use the better run-time performance. The following example anal- structured retrieval approach when the cost of extraction E ysis uses estimates from the retrieval approaches, queries, exceeds 0.87 seconds per retrieved sentence, on average. and system used for the experiments in Section 5.1. Single structure case: On the test topics, assuming only In this analysis we assume that the experimenter sets a sentences matching a specific answer-bearing structure are desired average recall level l to be achieved by the retrieval relevant, nb(l) = 787, ns(l) = 107, Rs = 30 seconds and system. Let Cs(l) be the cost of retrieval and answer extrac- Rb = 15 seconds, on average. As a result, it would be more tion of the structured retrieval approach and Cb(l)bethe efficient to use the structured retrieval approach when E> 0.022 seconds per retrieved sentence, on average.

357 SIGIR 2007 Proceedings Session 14: Question Answering

This analysis is simplified for the sake of this paper, and common or likely ways of expressing the answer in the cor- other researchers would have to consider the costs using their pus. By choosing the structured retrieval approach instead own systems before making a decision. Factors affecting E of bag-of-words, a QA system can improve recall of relevant include any processing the QA system may do downstream sentences, which can translate to improved end-to-end QA of retrieval. If the corpus has not been pre-annotated, the system accuracy and efficiency. time required to analyze results on-the-fly will be added to E. Additional downstream processing that may affect E 8. REFERENCES includes answer merging, validation and presentation. [1] D. Bikel, et al. An algorithm that learns what’s in a name. ML, 34(1-3):211–231, 1999. 7. CONTRIBUTIONS [2] M. Bilotti, et al. What works better for question This paper presents an approach to retrieval for Question answering: Stemming or morphological query Answering that directly supports indexing and retrieval on expansion? In Proc. of IR4QA at SIGIR, 2004. the kind of linguistic and semantic constraints that a QA [3] D. Carmel, et al. Searching XML documents via XML system needs to determine relevance of a retrieved docu- fragments. In Proc. of SIGIR, 2003. ment to a particular natural language input question. Our [4] H. Cui, et al. Question answering passage retrieval approach to structured retrieval for QA works by encod- using dependency relations. In Proc.ofSIGIR, 2005. ing this linguistic and semantic content as annotations on [5] D. Graff. The AQUAINT Corpus of English News text, and by using a retrieval model that directly supports Text. LDC, 2002. Cat. No. LDC2002T31. constraint-checking and ranking with respect to document [6] S. Harabagiu, et al. Employing two question answering structure and annotations in addition to keywords. systems in TREC-2005. In Proc. of TREC-14, 2005. We present an evaluation that shows that, when a QA sys- tem is able to posit the likely answer-bearing structures it is [7] P. Kingsbury, et al. Adding semantic annotation to looking for within the text collection, structured queries re- the penn treebank. In Proc. of HLT, 2002. trieve more relevant results, more highly-ranked, when com- [8] C. Lin and E. Hovy. The automated acquisition of pared to traditional bag-of-words retrieval. In addition to topic signatures for text summarization. In Proc. of providing a higher quality ranked list of results to pass to COLING, 2000. downstream modules, structured retrieval can also reduce [9] J. Lin and B. Katz. Building a reusable test collection the sum total retrieval and extraction time. Although a for question answering. JASIST, 57(7):851–861, 2006. structured query is slower to execute than a bag-of-words [10] M. Marcus, et al. Building a large annotated corpus of query, the system does not have to parse the retrieved re- english: the penn treebank. CL, 19(2):313–330, 1993. sults or process as far down the ranked list as it would have [11] M. Montague and J. Aslam. Relevance score to if it were working with less relevant results obtained from normalization for metasearch. In Proc. of CIKM, 2001. a bag-of-words query. Interactive systems can respond more [12] P. Ogilvie and J. Callan. Combining document quickly using structured retrieval, as they would not have representations for known-item search. In Proc. of to parse texts at retrieval time while the user waits. SIGIR, 2003. We observe that structured retrieval can be more effec- [13] S. Pradhan, et al. Shallow using tive than bag-of-words, but we have not spent much effort support vector machines. In Proc. of HLT, 2004. optimizing data structures and algorithms. With current [14] J. Prager, et al. Question-answering by predictive tools, a benefit is realized at 200 sentences, which is rea- annotation. In Proc.ofSIGIR, 2000. sonable for today’s QA systems. Current QA systems do [15] T. Rajashekar and W. B. Croft. Combining automatic not process thousands of results, preferring instead to find and manual index representations in probabilistic a “sweet spot” of enough results to achieve sufficient recall, retrieval. JASIS, 46(4):272–283, 1995. but not so much that they can not be post-processed. With [16] J. Reynar and A. Ratnaparkhi. A maximum entropy more efficient retrieval algorithms and indexing structures, approach to identifying sentence boundaries. In Proc. we may see a benefit for a broader range of conditions than of ANLP, 1997. currently observed in this paper. We characterize the extent to which structured retrieval [17] J. Shaw and E. Fox. Combination of multiple searches. performance depends on the quality of the predicate-argument In Proc.ofTREC-3, 1994. annotations, leaving named entity annotations for future [18] T. Strohman, et al. Indri: A language model-based work. Although accuracy degrades when the annotation for complex queries. In Proc. of ICIA, quality degrades, the relative performance edge that struc- 2005. tured retrieval enjoys over bag-of-words retrieval is main- [19] S. Tellex, et al. Quantitative evaluation of passage tained. The reason for this is that the QA system can nei- retrieval algorithms for question answering. In Proc. of ther determine relevance nor extract answers from text that SIGIR, 2003. can not be properly annotated by the available tools. This [20] H. Turtle and W. Croft. Evaluation of an inference is true whether the text is annotated ahead of time, as in network-based retrieval model. ACM TOIS, the structured retrieval approach, or on-the-fly, after having 9(3):187–222, 1991. been retrieved by a bag-of-words query. [21] E. Voorhees, et al. The collection fusion problem. In If the structure of a sentence impacts the relevance of that Proc. of TREC-3, 1994. sentence, it is crucial to index and query on that structure. [22] C. Zhai and J. Lafferty. A study of smoothing Within the context of a QA system, it is reasonable to imag- methods for language models applied to ad hoc ine that the process of question analysis could yield a small . In Proc.ofSIGIR, 2001. set of candidate answer-bearing structures that would match

358