
Using WordNet into UKB in a Question Answering System for Basque Olatz Perez de Vinaspre,˜ Maite Oronoz and Olatz Ansa University of the Basque Country Donostia [email protected], fmaite.oronoz, [email protected] Abstract CLEF2008 conference (Forner et al., 2008). In this bank, the answer given as correct for this ques- This paper presents the use of seman- tion is the following: tic information at chunk level in a Ques- “Harry Trumanek Franklin Roosevelt ordezkatu tion Answering system called Ihardetsi. zuen EEBBetako lehendakaritzan 1944. urtean.” The semantic information has been added (“Harry Truman replaced Franklin Roosevelt in through a tool called UKB. For this ex- the presidency of the US in 1944.”) periment, UKB uses the Basque Word- The search of the named entity “EEBB” Net to compute the similarity between the (“US”), the common noun “lehendakari” (“pres- chunks. We use this added information to ident”) and the date “1944” separately, does not help Ihardetsi to choose the correct answer guarantee that the president to be found by the sys- among all the extracted candidates. Along tem will be from the US. For example, searching with the description of the system, we out- in Google these three elements, we obtain among line its performance presenting an experi- others the sentence “The decision of President ment and the obtained results. Edwin Barclay (1930-1944) to adopt the US dollar as the sole legal tender in Liberia...” in which the 1 Introduction president is from “Liberia”. The use of chunks, Question answering systems deal with the task of that is noun and verbal phrases, in the question finding a precise and concrete answer for a nat- answering system, i.e. “EEBBtako lehendakari” ural language question on a document collection. (“president of the US”), would reduce the search- These systems use Information Retrieval (IR) and ing space of the system. Natural Language Processing (NLP) techniques to On the other hand, and as we can see in the pre- understand the question and to extract the answer. vious example, sometimes the terms used in the Ihardetsi (Ansa et al., 2009), a Basque ques- question and in the possible answers, although are tion answering system, takes questions written in not the same (“president of the US” in the ques- Basque as input and obtains the results from a cor- tion, and “presidency of the US” in the best an- pus written in Basque too. The stable version of swer) the terms refer to the same concept. That is Ihardetsi incorporates tools and resources devel- one of the reasons why we decided to use the se- oped in the IXA group, such as the morphosyntac- mantic similarity of the chunks to try to improve tic analyzer (Aduriz et al., 1998) and the named Ihardetsi. The similarity algorithm we use has the entity recognizer and classifier (Fernandez et al., Basque WordNet (Pociello et al., 2010) as its base- 2011). Nevertheless, we can assume that the use of ontology. As this ontology lacks of named entities more syntactic and semantic information in Ihard- we have included some of them with their corre- etsi, will probably improve the quality of the ob- sponding synsets to the dictionary used by the al- tained answers. Let us see, for instance, the next gorithm. question in Basque: As seen in the previous example, the use of shal- “Nor izendatu zuten EEBBtako lehendakari low syntactic information and semantic informa- 1944. urtean?” (“Who was appointed president tion seems to be helpful, so we have integrated of the US in the year 1944?”) more linguistic knowledge in Ihardetsi. We have This question belongs to the Gold Stan- integrated the IXATI chunker (Aduriz et al., 2004) dard question bank defined for Basque for the in the analysis chain and we have used a similarity )uestions Morfeus %uestion Ana.ysis NLP )uestion processin1 classification Eihera )uery terms extraction an. expansion Ixati )uery terms Passage Retrieva. asqueWN )uery Passa1e 1eneration retieval )uestion type Expecte. answer type Passa1es Answer E7traction Can.i.ate Answer U$ extraction selection IHAR ETSI Answer Figure 1: General architecture of the system. algorithm that is implemented into a tool called Question Analysis: the main goal of this mod- UKB (Agirre et al., 2009). The chunks are ob- ule is to analyze the question and to generate the tained both in the question and in all the candidate information needed for the next tasks. Concretely, answer-passages. a set of search terms is extracted for the passage The remainder of the paper is organized as fol- retrieval module, and the expected answer type lows. Section two is devoted to introduce the gen- along with some lexical and syntactic information eral architecture of the system. In section three is passed to the answer extraction module. Before we describe the work done when comparing se- our contributions, this module used to analyze the mantically the chunks from the questions and from questions at morphological level with an analyzer the candidate answers. In section four evaluation called Morfeus (Aduriz et al., 1998), and a named issues are discussed. Finally, section five con- entity recognizer called Eihera (Fernandez et al., tains the conclusions and suggestions for future re- 2011). After the changes described in this paper, search. the chunker called Ixati is added to this module, enriching this way, the question analysis linguistic 2 Ihardetsi - A QA System for Basque chain. Language Passage Retrieval: basically an information re- The principles of versatility and adaptability have trieval task is performed, but in this case the re- guided the development of Ihardetsi. It is based on trieved units are passages and not entire docu- web services and integrated by the SOAP (Sim- ments. This module receives as input the selected ple Object Access Protocol) communication pro- query terms and produces a set of queries that are tocol. The linguistic tools previously developed passed to a search engine. in the IXA group are reused as autonomous web Answer Extraction: in this module two tasks are services, and the QA system becomes a client that performed in sequence: the candidate extraction calls these services when needed. This distributed and the answer selection. Basically, the candidate model allows to parameterize the linguistic tools, extraction consists of extracting all the candidate and to adjust the behavior of the system. answers from the retrieved passages, and the an- As it is common in question answering sys- swer selection consists of choosing the best an- tems, Ihardetsi is based on three main modules: swers among the considered as candidates. The the question analysis module, the passage retrieval chunker is applied to the candidate answer pas- module and the answer extraction module. Those sages extracted by the stable version of Ihardetsi modules can be seen in the figure 1. that uses a kind of “bag of words” technique. For the work presented in this paper, a re-ranking of Similarity algorithms measure the semantic the candidate answers is performed using the se- similarity and relatedness between terms or texts. mantic similarity algorithm from UKB. The num- This concrete algorithm in UKB is able to estimate ber of candidates to be shown could be parameter- the similarity measure between two texts, based ized but usually five answers are presented to the on the relations of the LKB senses. The method user. has basically two steps: first, it computes the Per- sonalized PageRank over WordNet separately for 3 Comparison at chunk level using each text, producing a probability distribution over WordNet WordNet synsets. Then, it compares how similar these two discrete probability distributions are by Having applied shallow syntax to the text in- encoding them as vectors and computing the co- volved in the QA process, it is possible to com- sine among the vectors. pare syntactically the chunks from the question When using UKB and WordNet applied to the with the according chunks from the candidate an- question answering area, we have found some swer passages; but also the semantic similarity of problems related to the semantic ambiguity of the the chunks could be measured. Although we have chunks and to the lack of information in WordNet. used both syntactic and semantic information to These problems will be extensively explained in re-rank the answers, we will focus on the seman- section 4. tic area in this paper. The next section describes deeply this work. 3.2 Procedure to get a weight for each candidate answer 3.1 Semantic similarity - UKB similarity The re-ranking of the candidate answers in Ihard- UKB is a collection of programs to perform graph- etsi is performed by normalizing the weights ob- based Word Sense Disambiguation and lexical tained after the analysis of several syntactic and similarity/relatedness using a pre-existing knowl- semantic characteristics. Before explaining this edge base. It applies the so-called Personalized process, in this section we will explain some lin- PageRank on a Lexical Knowledge Base (LKB) guistic phenomena used for the definition of the to rank the vertices of the LKB and thus it per- weights, and then, we will show by means of an forms disambiguation. The algorithm can also be example, which features are taken into account for used to calculate lexical similarity/relatedness of the weight assignment. words/sentences (Agirre et al., 2010a) (Agirre and We have defined some syntactic patterns at shal- Soroa, 2009). low syntax level in order to describe the behavior We took the decision of using UKB according of some interrogative pronouns in Basque, such to different reasons: i) it is developed by our same as “Nor” (“Who”), “Non” (“Where”), “Noiz” research group, the IXA group; ii) it is language (“When”) and “Zein” (“Which”).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-