Mining Web Snippets to Answer List Questions
Total Page:16
File Type:pdf, Size:1020Kb
Mining Web Snippets to Answer List Questions Alejandro Figueroa G¨unter Neumann Deutsches Forschungszentrum f¨urK¨unstliche Intelligenz - DFKI, Stuhlsatzenhausweg 3, D - 66123, Saarbr¨ucken, Germany Email: figueroa neumann @dfki.de f j g Abstract ally, QAS tackle list questions by making use of pre- compiled, often manually checked, lists (i. e. famous This paper presents ListWebQA, a question answer- persons and countries) and online encyclopedias, like ing system that is aimed specifically at extracting an- Wikipedia and Encarta, but with moderate success. swers to list questions exclusively from web snippets. Research has been hence conducted towards exploit- Answers are identified in web snippets by means of ing full web documents, especially their lists and ta- their semantic and syntactic similarities. Initial re- bles. sults show that they are a promising source of answers This paper presents our research in progress to list questions. (\Greenhouse work") into list question answering on the web. Specifically, it presents ListWebQA, our list Keywords: Web Mining, Question Answering, List question answering system that is aimed at extract- Questions, Distinct Answers. ing answers to list questions directly from the brief descriptions of web-sites returned by search engines, 1 Introduction called web snippets. ListWebQA is an extension of our current web question answering system1, which is In recent years, search engines have markedly im- aimed essentially at mining web snippets for discover- proved their power of indexing, provoked by the sharp ing answers to natural language questions, including increase in the number of documents published on factoid and definition questions (Figueroa and Atkin- the Internet, in particular, HTML pages. The great son 2006, Figueroa and Neumann 2006, 2007). success of search engines in linking users to nearly The motivation behind the use of web snippets as all the sources that satisfy their information needs a source of answers is three-fold: (a) to avoid, when- has caused an explosive growth in their number, and ever possible, the costly retrieval and processing of analogously, in their demands for smarter ways of full documents, (b) to the user, web snippets are the searching and presenting the requested information. first view of the response, thus highlighting answers Nowadays, one of these increasing demands is find- would make them more informative, and (c) answers ing answers to natural language questions. Most of taken from snippets can be useful for determining the the research into this area has been carried out under most promising documents, that is, where most of an- the umbrella of Question Answering Systems (QAS), swers are likely to be. An additional strong motiva- especially in the context of the Question Answering tion is, the absence of answers across retrieved web track of the Text REtrieval Conference (TREC). snippets can force QAS a change in its search strat- In TREC, QAS are encouraged to answer several egy or a request for additional feedback from the user. kinds of questions, whose difficulty has been system- On the whole, exploiting snippets for list question an- atically increasing during the years. In 2001, TREC swering is a key research topic of QAS. incorporated list questions, such as \What are 9 nov- The roadmap of this paper is as follows: section els written by John Updike?" and \Name 8 Chuck 2 deals at greater length with the related work. Sec- Berry songs", into the question answering track. Sim- tion 3 describes ListWebQA in detail, section 4 shows ply stated, answering this sort of question consists in current results, and section 5 draws preliminary con- discovering a set of different answers in only one or clusions. across several documents. QAS must therefore, effi- ciently process a wealth of documents, and identify as 2 Related Work well as remove redundant responses in order to satis- factorily answer the question. In the context of TREC, many methods have been Modest results obtained by QAS in TREC show explored by QAS in order to discover answers to list that dealing with this kind of question is particu- questions across the target collection of documents larly difficult (Voorhees 2001, 2002, 2003, 2004), mak- 2 ing the research in this area very challenging. Usu- (the AQUAINT corpus). QAS usually start by dis- tinguishing the \focus" of the query, the most de- scriptive noun phrase of the expected answer type The work presented here was partially supported by a research (Katz et al. 2003). The focus associates the question grant from the German Federal Ministry of Education, Science, with its answer type, and hence answering depends Research and Technology (BMBF) to the DFKI project HyLaP (FKZ: 01 IW F02) and the EC-funded project QALL-ME. largely upon its correct identification. To illustrate, the focus of the query \Name 6 comets" is the plural noun \comets", and QAS will then only pay atten- Copyright c 2007, Australian Computer Society, Inc. This pa- tion to names of comets during the search. For the per appeared at the Second Workshop on Integrating AI and Data Mining (AIDM 2007), Gold Coast, Australia. Confer- purpose of finding right answers, some QAS take into ences in Research and Practice in Information Technology (CR- 1ListWebQA is part of our sustained efforts to implement a public PIT), Vol. 84, Kok-Leong Ong, Junbin Gao and Wenyuan Li, TREC-oriented QAS on web snippets. Our system is available at Ed. Reproduction for academic, not-for profit purposes per- http://experimental-quetal.dfki.de/. mitted provided this text is included. 2http://www.ldc.upenn.edu/Catalog/byType.jsp account pre-defined lists of instances of several foci. based on co-occurrence across a set of downloaded For example, (Katz et al. 2004) accounted for a list documents. They showed that finding the precise of 7800 famous people extracted from biography.com. correspondence between lists elements and the right They additionally increased their 150 pre-defined and hypernym is a difficult task. In addition, many hy- manually compiled lists used in TREC 2003, to 3300 ponyms or answers to list questions cannot be found in TREC 2004 (Katz et al. 2003). These lists were in lists or tables, which are not necessarily complete, semi-automatically extracted from WorldBook Ency- specially in online encyclopedias. QAS are, therefore clopedia articles by searching for hyponomyns. In forced to search along the whole text or across several TREC 2005, (Katz et al. 2005) generated these lists documents in order to discover all answers. To illus- off-line by means of subtitles and link structures pro- trate, two good examples in Wikipedia, at the time of vided by Wikipedia. This strategy involved process- writing, are the TREC questions \Who were 6 actors ing a whole document and its related documents. The who have played Tevye in Fiddler on the Roof?" and manual annotation consisted in adding synonymous \What are 12 types of clams?". noun phrases that could be used to ask about the list. (Yang and Chua 2004c) also exploited lists and Finding answers, consequently, consists in matching tables as sources of answers to list questions. They elements of these pre-defined lists with a set of re- fetched more than 1000 promising web pages by trieved passages. As a result, they found that online means of a query rewriting strategy that increased resources, such as Wikipedia, slightly improved the the probability of retrieving documents containing an- recall for the TREC 2003 and 2004 list questions sets, swers. This rewriting was based upon the identifi- but not for TREC 2005, despite the wide coverage cation of part-of-speech (POS), Name Entities(NEs) provided by Wikipedia. (Katz et al. 2005) eventu- and a subject-object representation of the prompted ally selected the best answer candidates according to question. Documents are thereafter downloaded and a given threshold. clustered. They also noticed that there is usually a Another common method used by QAS is inter- list or table in the web page containing several po- preting a list question as a traditional factoid query tential answers. Further, they observed that the title and finding its best answers afterwards. In this strat- of pages, where answers are, is likely to contain the egy, low-ranked answers are also cut-off according subject of the relation established by the submitted to a given threshold (Schone et al. 2005). Indeed, query. They extracted then answers and projected widespread techniques for discovering answers to fac- them on the AQUAINT corpus afterwards. In this toid questions based upon redundancy and frequency method, the corpus acts like a filter for misleading counting, tend not to work satisfactorily on list ques- and spurious answers. As a result, they improved the tions, because systems must return all different an- F1 score of the best TREC 2003 system. swers, and thus the less frequent answers also count. (Cederberg and Windows 2003) distinguished pu- Some systems are, for this reason, assisted by sev- tative pairs hyponomy-hypernym on the British Na- eral deep processing tools such as co-reference reso- tional Corpus, by means of the patterns suggested by lution. This way, they handle complex noun phrase (Hearst 1992). They filtered out some spurious rela- constructions and relative clauses (Katz et al. 2005). tions found by these patterns, by inspecting their de- All things considered, QAS are keen on exploiting the gree of relatedness in the semantic space provided by massive redundancy of the web, in order to mitigate Latent Semantic Analysis (LSA) (Deerwester 1990). the lack of redundancy of the AQUAINT corpus, thus They built this semantic space by taking advantage of increasing the chance of detecting answers, while at the representation proposed by (Sch¨utze1997), and as the same time, lessening the need for deep processing.