Phd Thesis, University of Edinburgh, 2007

Finding Answers to Definition Questions on the Web Alejandro G. Figueroa A. Department of Computational Linguistics and Phonetics University of Saarland and DFKI - Language Technology Lab A dissertation submitted to the Philosophy Faculty of Saarland University in partial fulfilment of the requirements for the degree of Doctor of Philosophy Dissertation zur Erlangung des akademischen Grades eines Doktors der Philosophie der Philosophischen Fakultäten der Universität des Saarlandes vorgelegt von Alejandro G. Figueroa A. July, 2010 Doktorvater - Faculty Advisor: PD Dr. Günter Neumann Zweitbetreuer - Faculty Co-Advisor: Prof. Dr. Dietrich Klakow Keywords: Question Answering, Definition Questions, Language Models, Multilinguality, Feature Analysis, Dependency Paths, Chunking, Semantic Classes, Context Models, Defini- tion Questions in Spanish, Maximum Entropy Models, Web Search. ACM categories: Information Search and Retrieval - Content Search and Analysis (H.3.3), Artificial Intelligence - Natural Language Processing (I.2.7) Copyright c 2010 by Alejandro Figueroa ”The copyright of this thesis rests with the author. This work can be widely used for any research purpose without the author’s prior consent. Any commercial use should be with the author’s prior written consent and information derived from it should be acknowledged". to Adonai ii Acknowledgements Firstof all, I am verythankful to all the reviewersof my thesis; in particular, to my advisor Guenter Neumann and Prof. Dr. Klakow. I, additionally, owe a great debt of gratitude to all the reviewers who anonymously and unknowingly pointed out interesting comments on my work, and ergo helped me in its refinement. Also, many thanks to all the people who had interesting talks about definition question answering with me during these three years including Cesar de Pablo and Eduard Hovy. An special mention to Mihai Surdeanu, who supervised the work presented in chapter seven. In this respect, I am also profoundly indebted to Ricardo Baeza-Yates, who provided me the opportunity of visiting Yahoo! Research for three months as an intern, and starting this work. I must sincerely thank to the people who made my doctoral study possible in economic terms: Guenter Neumann, Michael Moossen, Thierry Declerk and Stephan Busemann. In special manner, the last three assisted me in the most difficult part of this project, that is, when the forces of evil tried to prematurely stop it. Last but not least, I enjoyed all the good talks that I had with Sergio Roa, Hendrik Zender and Dennis Stachowicz. They make the stay of a foreigner in the DFKI LT-Lab nicer. In the same way, I must highlight all the aid given by Hendrik in dealing with the German system. Finally, I want to express a very deep gratitude to Laurel Loch for her proofreading. iii Contents Abstract xiii 1 What is a definition? 1 1.1 Introduction .................................... 1 1.2 Definitionsin NaturalLanguageTextsacrosstheWeb . ......... 2 1.3 ArchetypalModulesofaDefinitionQASystem . .... 4 1.4 DefinitionQuestionsinTRECandCLEF. .... 5 1.5 TypesofDefinitionsQuestions . .... 6 1.6 WhatisaDefinition? ................................ 10 1.6.1 TypesofDefinitions ............................. 13 1.6.2 DefinitionsinTREC ............................. 14 1.6.3 LengthofDefinitions. ..... ...... ..... ...... ..... 15 1.7 EvaluationMetrics ................................ 15 1.8 Conclusions ..................................... 23 2 Crawling the Web for Definitions 25 2.1 Introduction .................................... 25 2.2 DefinitionRelationalDatabases . ...... 26 2.3 UsingSpecificResources. 30 2.4 Finding Additional Resources through Specific Task Cues . ........... 35 2.5 UsingLexico-SyntacticConstructs . ........ 36 2.5.1 The Grammatical Number of the Definiendum GuessedBeforehand? . 39 2.6 Enriching Query Rewriting with Wikipedia Knowledge . ......... 45 2.6.1 AliasResolution&Misspellings . 45 2.6.2 DefinitionFocusedSearch. 47 2.6.3 Experiments.................................. 48 2.7 SearchinginotherLanguages:Spanish. ........ 50 2.8 FutureTrends .................................... 53 2.9 Conclusions ..................................... 55 3 What Does this Piece of Text Talk about? 57 3.1 Introduction .................................... 57 3.2 Scatter Matches of the Definiendum .......................... 58 3.3 TopicShiftinDescriptiveSentences . ........ 59 3.4 Strategies to Match the Definiendum whenAligningDefinitionPatterns . 61 3.5 Co-reference&theNextSentence. ...... 69 3.6 Conclusions ..................................... 71 v 4 Heuristics for Definition Question Answering used by TREC systems 75 4.1 Introduction .................................... 75 4.2 DefinitionPatterns ................................ 76 4.3 KnowledgeBases ................................... 81 4.4 Definition Markers: Indicators of Potential Descriptive Knowledge . 83 4.5 Negative Evidence: Markers of Potential Non-Descriptions........... 87 4.6 TripletsRedundancy ..... ..... ...... ..... ...... ... 88 4.7 CombiningPositiveandNegativeEvidence . ........ 89 4.8 CentroidVector .................................. 91 4.9 SoftPatternMatching..... ..... ...... ..... ...... ... 93 4.10 Ranking Patterns in Concert with the TREC Gold Standard ........... 100 4.11 Combining TrigramLanguageModelswithTF-IDF . ....... 101 4.12 WebFrequencyCounts. 102 4.13 Part-of-SpeechandEntityPatterns . ......... 104 4.14 Phrases,HeadWordsandLocalTermsStatistics . .......... 105 4.15 Predicates as Nuggets and Page Layout as Ranking Attribute.......... 106 4.16 Propositions, Relations and Definiendum Profile.................. 107 4.17 TheDefinitionDatabaseandtheBLEUMetric . 109 4.18Conclusions .................................... 111 5 Extracting Answers to Multilingual Definition Questions from Web Snippets 115 5.1 Introduction .................................... 115 5.2 Multi-Linguality,WebSnippetsandRedundancy . ......... 116 5.2.1 SnippetContextMiner. 118 5.2.2 SenseDisambiguator. 120 5.2.3 DefinitionRanker............................... 121 5.2.4 ExperimentsandResults. 122 5.3 Web Snippets and Mining Multilingual Wikipedia Resources .......... 128 5.3.1 Learning Templates and Tuples from Wikipedia . 129 5.3.2 DefinitionTuplesRepository . 131 5.3.3 RankingDefinitions ............................. 132 5.3.4 Experiments.................................. 133 5.3.5 An extra try for Spanish: Extracting tuples from Dependency Trees . 136 5.4 Conclusions ..................................... 141 6 Using Language Models for Ranking Answers to Definitions Questions 143 6.1 Introduction .................................... 143 6.2 LanguageModelsandDependencyRelations . 144 6.3 Unigrams,BigramsorBi-terms . 145 6.4 TopicandDefinitionModels. 146 6.5 Contextual Models and Descriptive Dependency Paths . ......... 149 6.5.1 BuildingaTreebankofContextualDefinitions . 153 6.5.2 LearningContextualLanguageModels . 156 6.5.3 ExtractingCandidateAnswers . 159 6.5.4 ExperimentsandResults. 160 6.5.5 Expanding the Treebank of Descriptive Sentences . ........ 168 6.5.6 AddingPart-of-SpeechTagsKnowledge . 169 6.5.7 Improving Definition Extraction using Contextual Entities . ..... 172 6.5.8 Relatedness/SimilaritiesBetweenContexts . ......... 174 6.5.9 SomeSemanticRelationsInsideContexts . 177 vi 6.5.10 ProjectingAnswersintotheAQUAINTCorpus . 180 6.6 ConclusionsandFurtherWork . 183 7 Discriminant Learning for Ranking Definitions 189 7.1 Introduction .................................... 189 7.2 RankingSingle-SnippetAnswers . 190 7.3 Undecided/IndifferentLabels. ....... 194 7.4 RankingCopularDefinitionsinDutch . 197 7.5 AnsweringDefinitionsinTREC. 199 7.6 Ranking Pattern-Independent Descriptive Sentences . ............. 201 7.6.1 CorpusAcquisition. 203 7.6.2 SentenceLevelFeatures . 208 7.6.3 TestingSetsandBaseline . 211 7.6.4 ResultsandDiscussion. 212 7.6.5 FeaturesBasedonDependencyTrees . 217 7.6.6 Results: Features Extracted from Lexicalised Dependency Graphs . 222 7.7 ConclusionsandFurtherWork . 232 Glossary 237 List of Acronyms 241 vii Zusammenfassung Frage-Antwort-Systeme sind im Wesentlichen dafür konzipiert, von Benutzern in natür- licher Sprache gestellte Anfragen automatisiert zu beantworten. Der erste Schritt im Beant- wortungsprozess ist die Analyse der Anfrage, deren Ziel es ist, die Anfrage entsprechend einer Menge von vordefinierten Typen zu klassifizieren. Traditionell umfassen diese: Fak- toid, Definition und Liste. Danach wählten die Systeme dieser frühen Phase die Antwort- methode entsprechend der zuvor erkannten Klasse. Kurz gesagt konzentriert sich diese Ar- beit ausschließlich auf Strategien zur Lösung von Fragen nach Definitionen (z.B. „Wer ist Ben Bernanke?”). Diese Art von Anfrage ist in den letzten Jahren besonders interessant geworden, weil sie in beachtlicher Zahl bei Suchmaschinen eingeht. Die meisten Fortschritte in Bezug auf die Beantwortung von Fragen nach Definitionen wurden unter dem Dach der Text REtrieval Conference (TREC) gemacht. Das ist, genauer gesagt, ein Framework zum Testen von Systemen, die mit einer Auswahl von Zeitungsar- tikeln arbeiten. Daher, zielt Kapitel eins auf eine Beschreibung dieses Rahmenwerks ab, zusammen mit einer Darstellung weiterer einführender Aspekte der Beantwortung von Defi- nitionsanfragen. Diesen schließen u.a. ein: (a) wie Definitionsanfragen von Personen gestellt werden; (b) die unterschiedlichen Begriffe von Definition und folglich auch Antworten; und (c) die unterschiedlichen Metriken, die zur Bewertung von Systemen genutzt werden. Seit Anbeginn von TREC haben Systeme vielfältige

Load more