Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, July 2002, pp. 257-264. Association for Computational Linguistics. An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais and Michele Banko Microsoft Research One Microsoft Way Redmond, Wa. 98052 {brill,sdumais,mbanko}@microsoft.com ing the user’s query and matching sections in Abstract documents. The most common linguistic resources include: part-of-speech tagging, parsing, named We describe the architecture of the entity extraction, semantic relations, dictionaries, AskMSR question answering system and WordNet, etc. (e.g., Abney et al., 2000; Chen et al. systematically evaluate contributions of 2000; Harabagiu et al., 2000; Hovy et al., 2000; different system components to accuracy. Pasca et al., 2001; Prager et al., 2000). We chose The system differs from most question instead to focus on the Web as a gigantic data re- answering systems in its dependency on pository with tremendous redundancy that can be data redundancy rather than sophisticated exploited for question answering. We view our linguistic analyses of either questions or approach as complimentary to more linguistic ap- candidate answers. Because a wrong an- proaches, but have chosen to see how far we can swer is often worse than no answer, we get initially by focusing on data per se as a key also explore strategies for predicting resource available to drive our system design. Re- when the question answering system is cently, other researchers have also looked to the likely to give an incorrect answer. web as a resource for question answering (Buch- holtz, 2001; Clarke et al., 2001; Kwok et al., 2001). These systems typically perform complex 1 Introduction parsing and entity extraction for both queries and best matching Web pages, and maintain local Question answering has recently received attention caches of pages or term weights. Our approach is from the information retrieval, information extrac- distinguished from these in its simplicity and effi- tion, machine learning, and natural language proc- ciency in the use of the Web as a large data re- essing communities (AAAI, 2002; ACL-ECL, source. 2002; Voorhees and Harman, 2000, 2001). The Automatic QA from a single, small infor- goal of a question answering system is to retrieve mation source is extremely challenging, since there answers to questions rather than full documents or is likely to be only one answer in the source to any best-matching passages, as most information re- user’s question. Given a source, such as the trieval systems currently do. The TREC Question TREC corpus, that contains only a relatively small Answering Track, which has motivated much of number of formulations of answers to a query, we the recent work in the field, focuses on fact-based, may be faced with the difficult task of mapping short-answer questions such as “Who killed Abra- questions to answers by way of uncovering com- ham Lincoln?” or “How tall is Mount Everest?” plex lexical, syntactic, or semantic relationships In this paper we describe our approach to short between question string and answer string. The answer tasks like these, although the techniques we need for anaphor resolution and synonymy, the propose are more broadly applicable. presence of alternate syntactic formulations and Most question answering systems use a va- indirect answers all make answer finding a poten- riety of linguistic resources to help in understand- tially challenging task. However, the greater the answer redundancy in the source data collection, greater chance of finding matches. For each query, the more likely it is that we can find an answer that we generate a rewrite which is a backoff to a sim- occurs in a simple relation to the question. There- ple ANDing of all of the non-stop words in the fore, the less likely it is that we will need to solve query. the aforementioned difficulties facing natural lan- The rewrites generated by our system are guage processing systems. simple string-based manipulations. We do not use In this paper, we describe the architecture of a parser or part-of-speech tagger for query refor- the AskMSR Question Answering System and mulation, but do use a lexicon for a small percent- evaluate contributions of different system compo- age of rewrites, in order to determine the possible nents to accuracy. Because a wrong answer is parts-of-speech of a word as well as its morpho- often worse than no answer, we also explore logical variants. Although we created the rewrite strategies for predicting when the question answer- rules and associated weights manually for the cur- ing system is likely to give an incorrect answer. rent system, it may be possible to learn query-to- answer reformulations and their weights (e.g., 2 System Architecture Agichtein et al., 2001; Radev et al., 2001). As shown in Figure 1, the architecture of our sys- 2.2 N-Gram Mining tem can be described by four main steps: query- Once the set of query reformulations has been gen- reformulation, n-gram mining, filtering, and n- erated, each rewrite is formulated as a search en- gram tiling. In the remainder of this section, we gine query and sent to a search engine from which will briefly describe these components. A more page summaries are collected and analyzed. From detailed description can be found in [Brill et al., the page summaries returned by the search engine, 2001]. n-grams are collected as possible answers to the 2.1 Query Reformulation question. For reasons of efficiency, we use only the page summaries returned by the engine and not Given a question, the system generates a number the full-text of the corresponding web page. of weighted rewrite strings which are likely sub- The returned summaries contain the query strings of declarative answers to the question. For terms, usually with a few words of surrounding example, “When was the paper clip invented?” is context. The summary text is processed in accor- rewritten as “The paper clip was invented”. We dance with the patterns specified by the rewrites. then look through the collection of documents in Unigrams, bigrams and trigrams are extracted and search of such patterns. Since many of these string subsequently scored according to the weight of the rewrites will result in no matching documents, we query rewrite that retrieved it. These scores are also produce less precise rewrites that have a much summed across all summaries containing the n- Question Rewrite Query <Search Engine> “+the Louvre Museum +is located” Where is the Louvre “+the Louvre Museum +is +in” Museum located? “+the Louvre Museum +is near” “+the Louvre Museum +is” Louvre AND Museum AND near Collect Summaries, in Paris France 59% Mine N-grams museums 12% hostels 10% N-Best Answers Tile N-Grams Filter N-Grams Figure 1. System Architecture gram (which is the opposite of the usual inverse 3 Experiments document frequency component of docu- ment/passage ranking schemes). We do not count For experimental evaluations we used the first 500 frequency of occurrence within a summary (the TREC-9 queries (201-700) (Voorhees and Harman, usual tf component in ranking schemes). Thus, the 2000). We used the patterns provided by NIST for final score for an n-gram is based on the weights automatic scoring. A few patterns were slightly associated with the rewrite rules that generated it modified to accommodate the fact that some of the and the number of unique summaries in which it answer strings returned using the Web were not occurred. available for judging in TREC-9. We did this in a very conservative manner allowing for more spe- 2.3 N-Gram Filtering cific correct answers (e.g., Edward J. Smith vs. Edward Smith) but not more general ones (e.g., Next, the n-grams are filtered and reweighted ac- Smith vs. Edward Smith), and also allowing for cording to how well each candidate matches the simple substitutions (e.g., 9 months vs. nine expected answer-type, as specified by a handful of months). There also are substantial time differ- handwritten filters. The system uses filtering in ences between the Web and TREC databases (e.g., the following manner. First, the query is analyzed the correct answer to Who is the president of Bo- and assigned one of seven question types, such as livia? changes over time), but we did not modify who-question, what-question, or how-many- the answer key to accommodate these time differ- question. Based on the query type that has been ences, because it would make comparison with assigned, the system determines what collection of earlier TREC results impossible. These changes filters to apply to the set of potential answers found influence the absolute scores somewhat but do not during the collection of n-grams. The candidate n- change relative performance, which is our focus grams are analyzed for features relevant to the fil- here. ters, and then rescored according to the presence of All runs are completely automatic, starting such information. with queries and generating a ranked list of 5 can- A collection of 15 simple filters were devel- didate answers. For the experiments reported in oped based on human knowledge about question this paper we used Google as a backend because it types and the domain from which their answers can provides query-relevant summaries that make our be drawn. These filters used surface string fea- n-gram mining efficient. Candidate answers are a tures, such as capitalization or the presence of dig- maximum of 50 bytes long, and typically much its, and consisted of handcrafted regular expression shorter than that. We report the Mean Reciprocal patterns. Rank (MRR) of the first correct answer, the Num- 2.4 N-Gram Tiling ber of Questions Correctly Answered (NAns), and the proportion of Questions Correctly Answered Finally, we applied an answer tiling algorithm, (%Ans).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-