Experiments Adapting an Open-Domain Question Answering System to the Geographical Domain Using Scope-Based Resources

Experiments Adapting an Open-Domain Question Answering System to the Geographical Domain Using Scope-Based Resources Daniel Ferres´ and Horacio Rodr´ıguez TALP Research Center Software Department Universitat Politecnica` de Catalunya {dferres, horacio}@lsi.upc.edu Abstract tion Forum (CLEF) for European languages, and NTCIR for Asian languages. This paper describes an approach to adapt In this paper we describe our experiments in the an existing multilingual Open-Domain adaptation and evaluation of an ODQA system to Question Answering (ODQA) system for a Restricted Domain, the Geographical Domain. factoid questions to a Restricted Domain, GeoTALP-QA is a multilingual Geographi- the Geographical Domain. The adaptation cal Domain Question Answering (GDQA) sys- of this ODQA system involved the modifi- tem. This Restricted Domain Question Answer- cation of some components of our system ing (RDQA) system has been built over an existing such as: Question Processing, Passage Re- ODQA system, TALP-QA, a multilingual ODQA trieval and Answer Extraction. The new system that processes both factoid and definition system uses external resources like GNS questions (see (Ferres´ et al., 2005) and (Ferres´ et Gazetteer for Named Entity (NE) Classi- al., 2004)). The system was evaluated for Spanish fication and Wikipedia or Google in order and English in the context of our participation in to obtain relevant documents for this do- the conferences TREC and CLEF in 2005 and has main. The system focuses on a Geograph- been adapted to a multilingual GDQA system for ical Scope: given a region, or country, and factoid questions. a language we can semi-automatically ob- As pointed out in (Benamara, 2004), the Geo- tain multilingual geographical resources graphical Domain (GD) can be considered a mid- (e.g. gazetteers, trigger words, groups of dle way between real Restricted Domains and place names, etc.) of this scope. The open ones because many open domain texts con- system has been trained and evaluated for tain a high density of geographical terms. Spanish in the scope of the Spanish Geog- Although the basic architecture of TALP-QA raphy. The evaluation reveals that the use has remained unchanged, a set of QA components of scope-based Geographical resources is were redesigned and modified and we had to add a good approach to deal with multilingual some specific components for the GD to our QA Geographical Domain Question Answer- system. The basic approach in TALP-QA consists ing. of applying language-dependent processes on both 1 Introduction question and passages for getting a language independent semantic representation, and then extract- Question Answering (QA) is the task of, given ing a set of Semantic Constraints (SC) for each a query expressed in Natural Language (NL), re- question. Then, an answer extraction algorithm trieving its correct answer (a single item, a text extracts and ranks sentences that satisfy the SCs snippet,...). QA has become a popular task in the of the question. Finally, an answer selection mod- NL Processing (NLP) research community in the ule chooses the most appropriate answer. framework of different international ODQA eval- We outline below the organization of the paper. uation contests such as: Text Retrieval Confer- In the next section we present some characteris- ence (TREC) for English, Cross-Lingual Evalua- tics of RDQA systems. In Section 3, we present the overall architecture of GeoTALP-QA and de- 3 System Description scribe briefly its main components, focusing on GeoTALP-QA has been developed within the those components that have been adapted from an framework of ALIADO2 project. The system ODQA to a GDQA. Then, the Scope-Based Re- architecture uses a common schema with three sources needed for the experimentation and the ex- phases that are performed sequentially without periments are presented in Sections 4 and 5. In feedback: Question Processing (QP), Passage Re- section 6 we present the results obtained over a trieval (PR) and Answer Extraction (AE). More GD corpus. Finally, in Section 7 and 8 we describe details about this architecture can be found in our conclusions and the future work. (Ferres´ et al., 2005) and (Ferres´ et al., 2004). 2 Restricted Domain QA Systems Before describing these subsystems, we intro- duce some additional knowledge sources that have RDQAs present some characteristics that prevent been added to our system for dealing with the us from a direct use of ODQA systems. The most geographic domain and some language-dependent important differences are: NLP tools for English and Spanish. Our aim is to • Usually, for RDQA, the answers are searched develop a language independent system (at least in relatively small domain specific collec- able to work with English and Spanish). Lan- tions, so methods based on exploiting the re- guage dependent components are only included dundancy of answers in several documents in the Question Pre-processing and Passage Pre- are not useful. Furthermore, a highly accu- processing components, and can be easily substi- rate Passage Retrieval module is required be- tuted by components for other languages. cause frequently the answer occurs in a very 3.1 Additional Knowledge Sources small set of passages. One of the most important task to deal with the • RDQAs are frequently task-based. So, the problem of GDQA is to detect and classify NEs repertory of question patterns is limited al- with its correct Geographical Subclass (see classes lowing a good accuracy in Question Process- in Section 3.3). We use Geographical scope based ing with limited effort. Knowledge Bases (KB) to solve this problem. These KBs can be built using these resources: • User requirements regarding the quality of the answer tend to be higher in RDQA. As • GEOnet Names Server (GNS3). A world- (Chung et al., 2004) pointed out, no answer wide gazetteer, excluding the USA and is preferred to a wrong answer. Antarctica, with 5.3 million entries. • In RDQA not only NEs but also domain spe- • Geographic Names Information System cific terminology plays a central role. This (GNIS4). A gazetteer with 2.0 million entries fact usually implies that domain specific lex- about geographic features of the USA. icons and gazetteers have to be used. • Grammars for creating NE aliases. Ge- • In some cases, as in GD, many documents in- ographic NEs tend to occur in a great va- cluded in the collections are far to be stan- riety of forms. It is important to take this dard NL texts but contain tables, lists, ill- into account to avoid losing occurrences. formed sentences, etc. sometimes following A set of patterns for expanding have been a more or less defined structure. Thus, extrac- created. (e.g. <toponym> Mountains, tion systems based, as our, on the linguistic <toponym> Range, <toponym> Chain). structure of the sentences have to be relaxed in some way to deal with this kind of texts. • Trigger Words Lexicon. A lexicon con- taining trigger words (including multi-word More information about RDQA systems can be terms) is used for allowing local disambigua- found in the ACL 2004 Workshop on QA in Re- tion of ambiguous NE, both in the questions 1 stricted Domains and the AAAI 2005 Worshop and in the retrieved passages. on Question Answering in Restricted Domains 2 (Molla and Vicedo, 2005) . ALIADO. http://gps-tsc.upc.es/veu/aliado 3GNS. http://earth-info.nga.mil/gns/html 1http://acl.ldc.upenn.edu/acl2004/qarestricteddomain/ 4GNIS. http://geonames.usgs.gov/geonames/stategaz Working with geographical scopes avoids many (e.g. web search engines) is not possible to ambiguity problems, but even in a scope these do long queries. problems occur: 3.2 Language-Dependent Processing Tools • Referent ambiguity problem. This problem A set of general purpose NLP tools are used for occurs when the same name is used for sev- Spanish and English. The same tools are used for eral locations (of the same or different class). the linguistic processing of both the questions and In a question, sometimes it is impossible to the passages (see (Ferres´ et al., 2005) and (Ferres´ solve this ambiguity, and, in this case, we et al., 2004) for a more detailed description of have to accept as correct all of the possible in- these tools). The tools used for Spanish are: terpretations (or a superclass of them). Oth- erwise, a trigger phrase pattern can be used • FreeLing, which performs tokenization, mor- to resolve the ambiguity (e.g. ”Madrid” is phological analysis, POS tagging, lemmati- an ambiguous NE, but in the phrase, ”comu- zation, and partial parsing. nidad de Madrid” (State of Madrid), ambigu- • ABIONET, a NE Recognizer and Classifier ity is solved). Given a scope, we automati- (NERC) on basic categories. cally obtain the most common trigger phrase patterns of the scope from the GNS gazetteer. • EuroWordNet, used to obtain a list of synsets, a list of hypernyms of each synset, and the • Reference ambiguity problem. This prob- Top Concept Ontology class. lem occurs when the same location can have more than one name (in Spanish texts this fre- The following tools are used to process English: quently occurs as many place names occur • TnT, a statistical POS tagger. in languages other than Spanish, as Basque, Catalan or Galician). Our approach to solve • WordNet lemmatizer 2.0. this problem is to group together all the ge- • ABIONET. ographical names that refer to the same location. All the occurrences of the geograph- • WordNet 1.5. ical NEs in both questions and passages are substituted by the identifier of the group they • A modified version of the Collins parser. belong to. • Alembic, a NERC with MUC classes. We used the geographical knowledge avail- 3.3 Question Processing able in the GNS gazetteer to obtain this geographical NEs groups. First, for each place The main goal of this subsystem is to detect the name in the scope-based GNS gazetteer we Question Type (QT), the Expected Answer Type obtained all the NEs that have the same fea- (EAT), the question logic predicates, and the ques- ture designation code, latitude and longitude.

Experiments Adapting an Open-Domain Question Answering System to the Geographical Domain Using Scope-Based Resources

Adapting to Trends in Language Resource Development: a Progress

Characteristics of Text-To-Speech and Other Corpora

100000 Podcasts

Syntactic Annotation of Spoken Utterances: a Case Study on the Czech Academic Corpus

Named Entity Recognition in Question Answering of Speech Data

Reducing Out-Of-Vocabulary in Morphology to Improve the Accuracy in Arabic Dialects Speech Recognition

Detection of Imperative and Declarative Question-Answer Pairs in Email Conversations

Low-Cost Customized Speech Corpus Creation for Speech Technology Applications

Arabic Language Modeling with Stem-Derived Morphemes for Automatic Speech Recognition

Speech to Text to Semantics: a Sequence-To-Sequence System for Spoken Language Understanding

Data Available for Bolt Performers

The Workshop Programme