Geographical Information Resolution and Its Application to the Question Answering Systems
Total Page:16
File Type:pdf, Size:1020Kb
Geographical Information Resolution and its Application to the Question Answering Systems Daniel Ferres´ Domenech` [email protected] Director Horacio Rodr´ıguez Hontoria Memoria` del DEA i Projecte de Tesi Programa de Doctorat en Intel·ligencia` Artificial Departament de Llenguatges i Sistemes Informatics` Universitat Politecnica` de Catalunya 12 de Gener de 2007 2 Contents 1 Introduction 7 1.1 Structure of the Document . 8 2 Question Answering - State-of-the-art 9 2.1 History . 9 2.2 Classification of QA Systems . 11 2.2.1 Classification by Questioner Type . 11 2.2.2 Classification by Knowledge Used . 12 2.2.3 Classification by Question Types . 13 2.2.4 Classification by Domain . 14 2.2.5 Classification by Information Access . 15 2.3 Architecture of QA Systems . 15 2.3.1 Natural Language Analysis . 16 2.3.2 Question Classification . 23 2.3.3 Passage Retrieval . 25 2.3.4 Answer Extraction . 29 2.3.5 QA Architectures at TREC 2006 . 31 2.3.6 Cross-Lingual and non-English QA Systems . 33 2.4 Evaluation of QA systems . 34 2.4.1 QA Evaluation Frameworks . 34 2.5 Metrics of Factoid QA Systems . 35 2.5.1 QA Capabilities . 35 2.5.2 Question Processing . 36 2.5.3 Question Classification . 36 2.5.4 Passage Retrieval . 36 2.5.5 Answer Extraction . 37 2.5.6 Question Answering . 37 3 Geographical Information Retrieval - State-of-the-art 41 3.1 Geographical Information Retrieval . 41 3.2 GIR Issues . 42 3.3 GIR Approaches . 43 3 4 CONTENTS 3.3.1 Topic Processing and Collection Processing . 43 3.3.2 Document Retrieval . 46 3.3.3 Information Retrieval . 47 3.4 Geographical Information Retrieval Evaluations . 48 3.4.1 GeoCLEF . 48 3.4.2 GIR Systems at GeoCLEF Evaluations . 49 3.5 IR Evaluation Measures . 50 3.5.1 Precision . 50 3.5.2 Recall . 50 3.5.3 Fall-Out . 50 3.5.4 F1-measure . 51 3.5.5 Mean Average Precision . 51 4 Geographical Information Resolution - State-of-the-art 53 4.1 Geographical Gazetteers . 54 4.2 Toponym Resolution . 55 4.2.1 Toponym Resolution Architectures . 57 4.3 Evaluation Metrics . 58 4.3.1 Precision . 58 4.3.2 Coverage . 58 4.3.3 Toponym Score . 58 5 TALP-QA Question Answering Approach 59 5.1 TALP-QA Question Answering Approach . 59 5.2 TALP-QA Question Answering Architecture . 59 5.2.1 Collection Pre-processing . 59 5.2.2 Question Processing . 61 5.2.3 Passage Retrieval . 69 5.2.4 Factoid Answer Extraction . 69 5.3 Evaluation and Results at CLEF 2004 . 70 5.4 Evaluation and Results at CLEF 2005 . 73 5.5 Evaluation and Results at TREC 2004 . 77 5.6 Evaluation and Results at TREC 2005 . 79 6 GeoTALP-QA Geographical Question Answering Approach 83 6.1 System Description . 83 6.1.1 Additional Knowledge Sources . 84 6.1.2 Language-Dependent Processing Tools . 85 6.1.3 Question Processing . 85 6.1.4 Passage Retrieval . 87 6.1.5 Answer Extraction . 88 6.2 Resources for Scope-Based Experiments . 89 6.2.1 Language and Scope Based Geographical Question Corpus . 89 CONTENTS 5 6.2.2 Document Collection for ODQA Passage Retrieval . 90 6.2.3 Geographical Scope-Based Resources . 90 6.3 Experiments . 91 6.4 Results . 91 7 TALP-GeoIR Geographical IR Approach 95 7.1 GeoCLEF 2005 System Description . 96 7.1.1 Overview . 96 7.1.2 Collection Pre-processing . 96 7.1.3 Topic Analysis . 97 7.1.4 Document Retrieval . 99 7.1.5 Document Ranking . 100 7.2 GeoCLEF 2006 System Description . 101 7.2.1 Collection Processing . 101 7.2.2 Topic Analysis . 102 7.2.3 Geographical Document Retrieval with Lucene . 103 7.2.4 Document Retrieval using the JIRS Passage Retriever . 104 7.2.5 Document Ranking . 104 7.3 Experiments and Results at GeoCLEF 2005 . 104 7.4 Experiments and Results at GeoCLEF 2006 . 106 8 Geographical Named Entity Subclassification 109 8.1 Inductive Logic Programming Approach . 110 8.1.1 Knowledge Acquisition . 110 8.1.2 Learning Methodology . 110 8.1.3 Experiments . 112 8.1.4 Results . 113 8.2 SVM Approach for GNES . 116 8.2.1 Machine Learning . 119 8.2.2 Experiments . 119 8.2.3 Results . 121 9 Work Plan 125 9.1 Thesis Project Scheduling . 126 10 Related Publications 127 10.1 Geographical Question Answering . 127 10.2 Geographical Information Retrieval . 127 10.3 Open Domain Question Answering . 128 10.4 Multidocument Summarization . 128 10.5 Word Sense Disambiguation . 129 10.6 Named Entity Recognition and Classification . 129 Bibliography 131 6 CONTENTS Chapter 1 Introduction Information Retrieval and Question Answering techniques could be defined as algorithms that help the user to satisfy an information need. Question Answering (QA) is the task of, given a question expressed in Natural Language (NL), retrieving its correct answer (a single item, a text snippet,...) from closed collections or the Web. This task could be considered a step beyond Information Re- trieval (IR) which consists in searching information in documents, documents themselves, or meta- data which describe documents. IR systems retrieve all the documents which are relevant to a user query while retrieving as few non-relevant documents as possible (Baeza-Yates & Ribeiro-Neto, 1999). Usually the QA task consists in solve factoid questions. Factoid questions are questions that seek short fact-based answers like entities, organizations, persons, dates,. (e.g. What is the capital of France?, Who is the President of the United States?, Which is the color of the sky?). Question Answering and Information Retrieval need a set of Natural Language Processing (NLP) algorithms to perform a comprehension of a user textual request and the textual documents involved in the search. NLP techniques process electronic texts and analyze them in order to provide lexical, syntactic, semantic, and/or discourse information about the text. Major Internet search engines (e.g. Google, Yahoo, Lycos, etc.) are mainly dealing with Open Domain Information Retrieval and Question Answering. Due to the difficulty to treat specific do- main queries with an open domain technology, in the last years has emerged a growing community of NLP researchers that explore the focus on Restricted-Domains such as genomics (in several work- shops and international evaluations (Hersh et al., 2006)), geography (in Geographical Information Retrieval (GIR) workshops and evaluations (Gey et al., 2005)), laws (e.g. Legal track in TREC.