Experiments Adapting an Open-Domain Question Answering System to the Geographical Domain Using Scope-Based Resources
Daniel Ferres´ and Horacio Rodr´ıguez TALP Research Center Software Department Universitat Politecnica` de Catalunya {dferres, horacio}@lsi.upc.edu
Abstract tion Forum (CLEF) for European languages, and NTCIR for Asian languages. This paper describes an approach to adapt In this paper we describe our experiments in the an existing multilingual Open-Domain adaptation and evaluation of an ODQA system to Question Answering (ODQA) system for a Restricted Domain, the Geographical Domain. factoid questions to a Restricted Domain, GeoTALP-QA is a multilingual Geographi- the Geographical Domain. The adaptation cal Domain Question Answering (GDQA) sys- of this ODQA system involved the modifi- tem. This Restricted Domain Question Answer- cation of some components of our system ing (RDQA) system has been built over an existing such as: Question Processing, Passage Re- ODQA system, TALP-QA, a multilingual ODQA trieval and Answer Extraction. The new system that processes both factoid and definition system uses external resources like GNS questions (see (Ferres´ et al., 2005) and (Ferres´ et Gazetteer for Named Entity (NE) Classi- al., 2004)). The system was evaluated for Spanish fication and Wikipedia or Google in order and English in the context of our participation in to obtain relevant documents for this do- the conferences TREC and CLEF in 2005 and has main. The system focuses on a Geograph- been adapted to a multilingual GDQA system for ical Scope: given a region, or country, and factoid questions. a language we can semi-automatically ob- As pointed out in (Benamara, 2004), the Geo- tain multilingual geographical resources graphical Domain (GD) can be considered a mid- (e.g. gazetteers, trigger words, groups of dle way between real Restricted Domains and place names, etc.) of this scope. The open ones because many open domain texts con- system has been trained and evaluated for tain a high density of geographical terms. Spanish in the scope of the Spanish Geog- Although the basic architecture of TALP-QA raphy. The evaluation reveals that the use has remained unchanged, a set of QA components of scope-based Geographical resources is were redesigned and modified and we had to add a good approach to deal with multilingual some specific components for the GD to our QA Geographical Domain Question Answer- system. The basic approach in TALP-QA consists ing. of applying language-dependent processes on both 1 Introduction question and passages for getting a language inde- pendent semantic representation, and then extract- Question Answering (QA) is the task of, given ing a set of Semantic Constraints (SC) for each a query expressed in Natural Language (NL), re- question. Then, an answer extraction algorithm trieving its correct answer (a single item, a text extracts and ranks sentences that satisfy the SCs snippet,...). QA has become a popular task in the of the question. Finally, an answer selection mod- NL Processing (NLP) research community in the ule chooses the most appropriate answer. framework of different international ODQA eval- We outline below the organization of the paper. uation contests such as: Text Retrieval Confer- In the next section we present some characteris- ence (TREC) for English, Cross-Lingual Evalua- tics of RDQA systems. In Section 3, we present the overall architecture of GeoTALP-QA and de- 3 System Description scribe briefly its main components, focusing on GeoTALP-QA has been developed within the those components that have been adapted from an framework of ALIADO2 project. The system ODQA to a GDQA. Then, the Scope-Based Re- architecture uses a common schema with three sources needed for the experimentation and the ex- phases that are performed sequentially without periments are presented in Sections 4 and 5. In feedback: Question Processing (QP), Passage Re- section 6 we present the results obtained over a trieval (PR) and Answer Extraction (AE). More GD corpus. Finally, in Section 7 and 8 we describe details about this architecture can be found in our conclusions and the future work. (Ferres´ et al., 2005) and (Ferres´ et al., 2004). 2 Restricted Domain QA Systems Before describing these subsystems, we intro- duce some additional knowledge sources that have RDQAs present some characteristics that prevent been added to our system for dealing with the us from a direct use of ODQA systems. The most geographic domain and some language-dependent important differences are: NLP tools for English and Spanish. Our aim is to • Usually, for RDQA, the answers are searched develop a language independent system (at least in relatively small domain specific collec- able to work with English and Spanish). Lan- tions, so methods based on exploiting the re- guage dependent components are only included dundancy of answers in several documents in the Question Pre-processing and Passage Pre- are not useful. Furthermore, a highly accu- processing components, and can be easily substi- rate Passage Retrieval module is required be- tuted by components for other languages. cause frequently the answer occurs in a very 3.1 Additional Knowledge Sources small set of passages. One of the most important task to deal with the • RDQAs are frequently task-based. So, the problem of GDQA is to detect and classify NEs repertory of question patterns is limited al- with its correct Geographical Subclass (see classes lowing a good accuracy in Question Process- in Section 3.3). We use Geographical scope based ing with limited effort. Knowledge Bases (KB) to solve this problem. These KBs can be built using these resources: • User requirements regarding the quality of the answer tend to be higher in RDQA. As • GEOnet Names Server (GNS3). A world- (Chung et al., 2004) pointed out, no answer wide gazetteer, excluding the USA and is preferred to a wrong answer. Antarctica, with 5.3 million entries.
• In RDQA not only NEs but also domain spe- • Geographic Names Information System cific terminology plays a central role. This (GNIS4). A gazetteer with 2.0 million entries fact usually implies that domain specific lex- about geographic features of the USA. icons and gazetteers have to be used. • Grammars for creating NE aliases. Ge- • In some cases, as in GD, many documents in- ographic NEs tend to occur in a great va- cluded in the collections are far to be stan- riety of forms. It is important to take this dard NL texts but contain tables, lists, ill- into account to avoid losing occurrences. formed sentences, etc. sometimes following A set of patterns for expanding have been a more or less defined structure. Thus, extrac- created. (e.g.
Table 3: Some question patterns from Albayzin. 4 Resources for Scope-Based Experiments 4.2 Document Collection for ODQA Passage In this section we describe how we obtained the Retrieval resources needed to do experiments in the Span- In order to test our ODQA Passage Retrieval sys- ish Geography domain using Spanish. These re- tem we need a document collection with sufficient sources were: the question corpus (validation and geographical information to resolve the questions test), the document collection required by the off- of Albayzin corpus. We used the filtered Span- line ODQA Passage Retrieval, and the geograph- ish Wikipedia5. First, we obtained the original ical scope-based resources. Finally, we describe set of documents (26235 files). Then, we selected the experiments performed. two sets of 120 documents about the Spanish ge- ography domain and the non-Spanish geography 4.1 Language and Scope based Geographical domain. Using these sets we obtained a set of Question Corpus Topic Signatures (TS) (Lin and Hovy, 2000) for the Spanish geography domain and another set of We obtained a corpus of Geographical questions TS for the non-Spanish geography domain. Then, from Albayzin, a speech corpus (Diaz et al., 1998) we used these TS to filter the documents from that contains a geographical subcorpus with utter- Wikipedia, and we obtained a set of 8851 doc- ances of questions about the geography of Spain in uments pertaining to the Spanish geography do- Spanish. We obtained from Albayzin a set of 6887 main. These documents were pre-processed and question patterns. We analyzed this corpus and we indexed. extracted the following type of questions: Partial Direct, Partial Indirect, and Imperative Interroga- 4.3 Geographical Scope-Based Resources tive factoid questions with a simple level of dif- A Knowledge Base (KB) of Spanish Geography ficulty (e.g. questions without nested questions). has been built using four resources: We selected a set of 2287 question patterns. As a question corpus we randomly selected a set of 177 • GNS: We obtained a set of 32222 non- question patterns from the previous selection (see ambiguous place names of Spain. Table 3). These patterns have been randomly in- stantiated with Geographical NEs of the Albayzin • Albayzin Gazetteer: a set of 758 places. corpus. Then, we searched the answers in the Web • A Grammar for creating NE aliases. We cre- and the Spanish Wikipedia (SW). The results of ated patterns for the summit and state classes this process were: 123 questions with answer in (the ones with more variety of forms), and we the SW and the Web, 33 questions without answer expanded this patterns using the entries of Al- in the SW but with answer using the Web, and bayzin. finally, 21 questions without answer (due to the fact that some questions when instantiated cannot • A lexicon of 462 trigger words. be answered (e.g. which sea bathes the coast of We obtained a set of 7632 groups of place Madrid?)). We divided the 123 questions with an- names using the grouping process over GNS. swer in the SW in two sets: 61 questions for devel- These groups contain a total of 17617 place opment (setting thresholds and other parameters) and 62 for test. 5Spanish Wikipedia. http://es.wikipedia.org names, with an average of 2.51 place names per • Passage Retrieval. The evaluation of this group. See in Figure 1 an example of a group subsystem was performed using a set of cor- where the canonical term appears underlined. rect answers (see Table 5). We computed the answer accuracy: it takes into account the {Cordillera Pirenaica, Pireneus, Pirineos, Pyrenaei number of questions that have a correct an- Montes, Pyren´ ees´ , Pyrene, Pyrenees} swer in its set of passages.
Figure 1: Example of a group obtained from GNS. Retrieval Accuracy at N passages/snippets Mode N=10 N=20 N=50 N=100 Google 0.6612 0.6935 0.7903 0.8225 In addition, a set of the most common trigger +TWJ 0.6612 0.6774 0.7419 0.7580 +TWJ+TWE 0.6612 0.6774 0.7419 0.7580 phrases in the domain has been obtained from the +CE 0.6612 0.6774 0.7741 0.8064 GNS gazetteer (see Table 4). +QBE 0.8064 0.8387 0.9032 0.9354 +TWJ+QB+CE 0.7903 0.8064 0.8548 0.8870 Geographical Scope Google+All 0.7903 0.8064 0.8548 0.8870 Spain UK ODQA+Wiki 0.4354 0.4516 0.4677 0.5000 TRIGGER de NE NE TRIGGER Top-ranked TRIGGER NE TRIGGER NE Table 5: Passage Retrieval results (refer to sec- Trigger TRIGGER del NE TRIGGER of NE tion 3.4.2 for detailed information of the different Phrases TRIGGER de la NE TRIGGER a’ NE TRIGGER de las NE TRIGGER na NE query expansion acronyms).
Table 4: Sample of the top-ranked trigger phrases automatically obtained from GNS gazetteer for the • Answer Extraction. The evaluation of geography of Spain and UK. the ODQA Answer Extractor subsystem is shown in Table 6. We evaluated the ac- curacy taking into account the number of 5 Experiments correct and supported answers by the pas- sages divided by the total number of ques- We have designed some experiments in order to tions that have a supported answer in its set evaluate the accuracy of the GDQA system and of passages. This evaluation has been done its subsystems (QP, PR, and AE). For PR, we using the results of the top-ranked retrieval evaluated the web-based snippet retrieval using configuration over the development set: the Google with some variants of expansions, versus Google+TWJ+QB+CE configuration of the our ODQA Passage Retrieval with the corpus of snippet retriever. the SW. Then, the passages (or snippets) retrieved by the best PR approach were used by the two dif- Accuracy at N Snippets ferent Answer Extraction algorithms. The ODQA N=10 N=20 N=50 Answer Extractor has been evaluated taking into 0.2439 (10/41) 0.3255 (14/43) 0.3333 (16/48) account the answers that have a supported con- Table 6: Results of the ODQA Answer Extraction text in the set of passages (or snippets). Finally, subsystem (accuracy). we evaluated the global results of the complete QA process with the different Answer Extractors: ODQA and Frequency-Based. In Table 7 are shown the global results of the 6 Results two QA Answer Extractors used (ODQA and Frequency-Based). The passages retrieved by the This section evaluates the behavior of our GDQA Google+TWJ+QB+CE configuration of the snip- system over a test corpus of 62 questions and re- pet retriever were used. ports the errors detected on the best run. We evalu- Accuracy ated the three main components of our system and Num. Snippets ODQA Freq-based the global results. 10 0.1774 (11/62) 0.5645 (35/62) 20 0.2580 (16/62) 0.5967 (37/62) • Question Processing. The Question Clas- 50 0.3387 (21/62) 0.6290 (39/62) sification task has been manually evaluated. Table 7: QA results over the test set. This subsystem has an accuracy of 96.77%. We analyzed the 23 questions that fail in our web search-engines instead of using snippets. This best run. The analysis detected that 10 questions could be a good method to find more sentences had no answer in its set of passages. In 5 of these with supported answers. Finally, we expect to do questions it is due to have a non common ques- tests with English in another scope. tion or location. The other 5 questions have prob- lems with ambiguous trigger words (e.g. capital) Acknowledgements that confuse the web-search engine. On the other This work has been partially supported by the hand, 13 questions had the answer in its set of pas- European Commission (CHIL, IST-2004-506909) sages, but were incorrectly answered. The reasons and the Spanish Research Dept. (ALIADO, are mainly due to the lack of passages with the an- TIC2002-04447-C02). Daniel Ferres´ is sup- swer (8), answer validation and spatial-reasoning ported by a UPC-Recerca grant from Universitat (3), multilabel Geographical NERC (1), and more Politecnica` de Catalunya (UPC). TALP Research context in the snippets (1). Center is recognized as a Quality Research Group (2001 SGR 00254) by DURSI, the Research De- 7 Evaluation and Conclusions partment of the Catalan Government. This paper summarizes our experiments adapting an ODQA to the GD and its evaluation in Spanish in the scope of the Spanish Geography. Out of 62 References questions, our system provided the correct answer F. Benamara. 2004. Cooperative Question Answering to 39 questions in the experiment with the best re- in Restricted Domains: the WEBCOOP Experiment. In Proceedings of the Workshop Question Answering sults. in Restricted Domains, within ACL-2004. Our Passage Retrieval for ODQA offers less at- tractive results when using the SW corpus. The H. Chung, Y. Song, K. Han, D. Yoon, J. Lee, H. Rim, and S. Kim. 2004. A Practical QA System in Re- problem of using SW to extract the answers is that stricted Domains. In Proceedings of the Workshop it gives few documents with the correct answer, Question Answering in Restricted Domains, within and, it is difficult to extract the answer because ACL-2004. the documents contain tables, lists, ill-formed sen- J. Diaz, A. Rubio, A. Peinado, E. Segarra, N. Prieto, tences, etc. Our ODQA AE needs a grammati- and F. Casacuberta. 1998. Development of Task- cally well-structured text to extract correctly the Oriented Spanish Speech Corpora. In Procceed- answers. The QA system offers a low perfor- ings of the First International Conference on Lan- mance (33% of accuracy) when using this AE over guage Resources and Evaluation, pages 497–501, Granada, Spain, May. ELDA. the web-based retrieved passages. In some cases, the snippets are cut and we could expect a better D. Ferres,´ S. Kanaan, A. Ageno, E. Gonzalez,´ performance retrieving the whole documents from H. Rodr´ıguez, M. Surdeanu, and J. Turmo. 2004. Google. The TALP-QA System for Spanish at CLEF 2004: Structural and Hierarchical Relaxing of Semantic On the other hand, web-based snippet retrieval, Constraints. In C. Peters, P. Clough, J. Gonzalo, with only one query per question, gives good re- G. J. F. Jones, M. Kluck, and B. Magnini, editors, sults in Passage Retrieval. The QA system with CLEF, volume 3491 of Lecture Notes in Computer the Frequency-Based AE obtained better results Science, pages 557–568. Springer. than with the ODQA AE (62.9% of accuracy). D. Ferres,´ S. Kanaan, E. Gonzalez,´ A. Ageno, Finally, we conclude that our approach with H. Rodr´ıguez, M. Surdeanu, and J. Turmo. 2005. Geographical scope-based resources are notably TALP-QA System at TREC 2004: Structural and Hierarchical Relaxation Over Semantic Constraints. helpful to deal with multilingual Geographical In Proceedings of the Text Retrieval Conference Domain Question Answering. (TREC-2004). 8 Future Work C-Y. Lin and E. Hovy. 2000. The automated acquisi- tion of topic signatures for text summarization. In As a future work we plan to improve the AE mod- COLING, pages 495–501. Morgan Kaufmann. ule using a semantic analysis with extended con- D. Molla and J.L. Vicedo. 2005. AAAI-05 Work- texts (i.e. more than one sentence) and adding shop on Question Answering in Restricted Domains. some spatial reasoning. We also want to improve AAAI Press. to appear. the retrieval by crawling relevant documents from