Using Linguistically Motivated Features in Document Retrieval for Question Answering

Using Linguistically Motivated Features in Document Retrieval for Question Answering Luiz Augusto Sangoi Pizzato M.Sc., B.CompSc. This thesis is presented for the degree of Doctor of Philosophy Department of Computing Faculty of Science Macquarie University July 2009 Contents Abstract 5 Publications Arising from this Work 11 List of Acronyms 13 1 Introduction 15 1.1 Information Retrieval (IR) . 15 1.2 QuestionAnswering(QA) . 16 1.3 ResearchQuestion ........................ 19 1.4 ThesisOrganisation ....................... 20 2 Literature Review 23 2.1 InformationRetrieval. 23 2.1.1 RetrievalModels ..................... 25 2.1.1.1 VectorSpaceModel . 28 2.1.2 Evaluation......................... 30 2.1.2.1 Evaluation Framework . 30 2.1.2.2 Metrics . 31 2.2 QuestionAnswering ....................... 33 2.2.1 QATracksatNIST ................... 34 2.2.2 Evaluation Metrics . 38 2.2.3 QAFramework ...................... 39 2.2.3.1 Question Analysis . 40 2.2.3.2 IR and Passage Retrieval . 42 2.2.3.3 Answer Extraction . 44 2.2.4 QA and Linguistic Resources . 46 2.3 Information Retrieval for Question Answering . 48 2.3.1 IR as Offline QA . 49 2.3.2 Query Modification . 50 2.3.3 IR Models for QA . 51 1 CONTENTS 2.3.4 Evaluation Metrics . 52 2.4 RelatedWork........................... 54 2.4.1 Linguistically Motivated Retrieval for QA . 54 2.4.1.1 Dependency Relations . 55 2.4.1.2 Semantic Roles . 58 2.4.2 Semantic Role Labelling . 61 2.4.2.1 Semantic Triples . 63 2.4.3 StructuredRetrieval . 64 2.4.3.1 Index Representation . 65 2.4.3.2 VectorSpaceExtension . 67 2.5 ConcludingRemarks ....................... 68 3 Pseudo-Relevance Feedback 71 3.1 Trie Classifier . 73 3.2 RelevanceFeedbackUsingNamedEntities . 74 3.2.1 Implementation . 76 3.3 ExperimentsandEvaluation. 77 3.3.1 Results .......................... 79 3.4 ConcludingRemarks ....................... 84 4 Linguistically Motivated Indices 87 4.1 Representing Linguistic Information in IR . 88 4.2 InvertedFiles ........................... 91 4.3 Multi-layer Inverted File . 92 4.3.1 LayerRepresentation. 94 4.3.2 Representing Different Index Types . 97 4.4 Using the Multi-layer Index . 101 4.4.1 VectorSpaceRanking . 103 4.4.2 Triple-Vector Space Ranking . 104 4.5 PerformanceEvaluation . 107 4.5.1 DiskUsage ........................ 108 4.5.2 SpeedPerformance . 111 2 CONTENTS 4.6 ConcludingRemarks . 114 5 Question Prediction Language Model (QPLM) 115 5.1 ModelDefinition ......................... 116 5.1.1 Directional Semantic Relations . 119 5.1.2 Natural Language Question Generation . 119 5.1.3 Comparison with Existing Work . 120 5.2 UsingQPLM ...........................122 5.2.1 APartialQASystem . 124 5.2.1.1 QPLM for Question Analysis . 124 5.2.1.2 QPLM for Answer Extraction . 125 5.2.1.3 QPLM in IR . 126 5.3 Building QPLM . 127 5.3.1 FromPropBanktoQPLM. 128 5.3.2 Rule Learning . 131 5.3.3 Applying QPLM . 134 5.4 ConcludingRemarks . 137 6 Evaluation 139 6.1 ExperimentalSetup . 140 6.1.1 CorpusandQuestionSet . 140 6.1.1.1 Issues with the Evaluation Patterns . 141 6.1.1.2 Towards a Better Evaluation . 143 6.1.2 Evaluation Metrics and Statistical Significance . 147 6.1.3 IRFramework ...................... 150 6.1.4 QASystems........................ 151 6.1.5 OverallExperimentFramework . 153 6.2 Results...............................154 6.2.1 IRResults......................... 155 6.2.2 QAResults ........................ 157 6.2.3 ModelComparison . 158 6.2.3.1 Comparison with Bag-of-Words . 159 3 CONTENTS 6.2.3.2 Comparison with Syntactic Relations . 162 6.2.3.3 Comparison with Semantic Role Labelling . 164 6.3 ConcludingRemarks . 168 7 Final Remarks 171 7.1 FutureWork ........................... 173 7.2 ThesisContributions . 175 A Tupi Framework 177 A.1 Representing Relations in IR . 177 A.2 TupiFramework ......................... 178 A.2.1 RAMFile .........................180 A.2.2 TrieFile ..........................181 A.2.3 IUnit ...........................182 A.2.4 InvertedFile ........................184 A.2.5 DocumentList .......................188 A.2.6 IUnitBuilder .......................189 A.2.7 Indexer ..........................190 A.2.8 Retriever .........................191 A.3 JemuSystem ........................... 192 A.3.1 Implementation . 192 A.3.2 Implementation of IUnit .................194 A.3.3 Implementation of IUnitBuilder .............195 A.3.4 Implementation of RankingBuilder ...........196 A.4 ConcludingRemarks . 196 B Question Sets 197 B.1 List of Trustable Questions . 197 B.2 List of Self-Contained Questions . 200 Bibliography 220 4 Abstract This thesis investigates the impact of using linguistic features in the In- formation Retrieval (IR) stage of Question Answering (QA) systems. We hypothesise that techniques that are commonly used in the final answer extraction stage can improve the overall results of a QA system when adopted in the earlier IR stage. In particular, we study the use of the following information: i) named entities in a pseudo-relevance feedback process; and ii) semantic relations between words of questions and text sentences. The study of the use of named entities is inspired by the common practice of filtering out sentences that do not contain the expected answer type. We consequently introduce a pseudo-relevance feedback that inserts entities of the correct answer type in the original query. Our experiments show that this technique leads to a query drift and the final results do not improve with respect to a query without additional feedback. To study the use of relational information, we design an IR framework that is more efficient (in both speed and memory consumption) than standard approaches based on relational databases and on the concatenation of word pairs at the indexing stage. The resulting framework allows a multi- layer index that uses an extension to the standard vector space model as a ranking strategy. The resulting ranking strategy improves precision, without compromising the overall recall, by including linguistic word relations. We present the QPLM, a model of relational information that borrows concepts from Semantic Role Labelling (SRL) but is designed for the fast generation of annotation and its use for indexing and retrieval. The results are of quality comparable to SRL and indicate that linguistic information encoded in the form of semantic relations does enhance the retrieval quality of text and the final accuracy of QA systems. 5 Statement of Candidate Statement of Candidate I certify that the work in this thesis entitled “Using Linguistically Moti- vated Features in Document Retrieval for Question Answering” has not previously been submitted for a degree nor has it been submitted as part of requirements for a degree to any other university or institution other than Macquarie University. I also certify that the thesis is an original piece of research and it has been written by me. Any help and assistance that I have received in my research work and the preparation of the thesis itself have been appro- priately acknowledged. In addition, I certify that all information sources and literature used are indicated in the thesis. Signature: Luiz Augusto Sangoi Pizzato - 40452239 Sydney, 20 July, 2009 7 Acknowledgements I would like to acknowledge that this thesis would not be otherwise possible without the financial assistance of the iMURS scholarship from Macquarie University and CSIRO, and without the scholarship from the ICS department in this last year. I would like to express my gratitude to my supervisor Dr. Diego Mollá for the guidance throughout all these years and for giving me the motivation needed to complete the hard work involved in a PhD thesis. I am highly indebted to him for his support and belief in me. I would like to thank my co-supervisors Dr. Cécile Paris, who with her expertise provided me with invaluable feedback that helped to shape this research, and Dr. Rolf Schwitter, who supervised me during Diego’s sabbatical and gave me important suggestions for this final document. I would like to say muito obrigado to my family, who even though were geographically distant for the whole duration of my studies, were never far away from my heart. Finally, I would like to express my greatest thanks to my wife Joanne. She is the most important person in my life, whose love and support gave me the emotional balance so much needed during all the ups and downs of the PhD. Joanne is a superwoman, who not only worked full-time to support the two of us during this final year, but also made time to read and re-read all the thesis drafts for me. Joanne, if I could I would give you anything in the universe, but the only thing that I have is my love for you and this thesis, which I dedicate to you. 9 Publications Arising from this Work The research described in this thesis produced a number of peer-reviewed publications. The following publications detail the trie-based classification method described in Chapter 3. (i) Luiz Augusto Sangoi Pizzato. Using a trie-based structure for question analysis. In Proceedings of the Australasian Language Technology Workshop 2004, pages 25–31, Macquarie University, Sydney, Australia, 2004. ASSTA. (ii) Menno van Zaanen, Luiz Augusto Pizzato, and Diego Mollá. Question classification by structure induction. In Proceedings of the International Joint Conferences on Artificial Intelligence, Edinburgh, UK, pages 1638– 1639. 2005. (iii) Menno van Zaanen, Luiz Pizzato, and Diego Mollá. Classifying sentences using induced structure. In Proceedings of the Twelfth Edition of the Sympo- sium on String Processing and Information Retrieval (SPIRE2005), Buenos Aires, Argentina, pages 139–150. 2005. The following publication describes the QA system named MetaQA, which is used for the evaluation of the retrieved set of documents in Chap- ter 6. (iv) Luiz Augusto Pizzato and Diego Mollá. Extracting exact answers using a meta question answering system. In Proceedings of the Australasian Language Technology Workshop 2005, pages 105–112, Sydney, Australia, 2005. The following publication describes the pseudo-relevance feedback technique detailed in Chapter 3.

Using Linguistically Motivated Features in Document Retrieval for Question Answering

English and Any Local Or Regional Language in Which the Celebrity Spokesperson Is Expected to Communicate Or Receive Coverage

Consuming Globalization: Youth and Gender in Kerala, India

Question Answering with SEMEX at TREC 2005

Syntax-Based Concept Extraction for Question Answering

Stiahnuť Vo Formáte

The Globalization of Cosmetic Surgery: Examining BRIC and Beyond Lauren E

World Cup 2003

Guantanamo Bay Justine Pasek

HIV/AIDS Prevention Paradigms: Are Individuals with Disabilities Neglected?

UNFPA New York, NY 10017 USA 2001 Telephone: (212) 297-5020 at Work Web Site

Special New Year's Message

Infanticide, Abortion Responsible for 60 Million Girls Missing in Asia (Foxnews.Com) PRI Staff / June 13, 2007