Chapter 1 Introduction……………………………………… 8
Total Page:16
File Type:pdf, Size:1020Kb
Arabic/English Cross-language Question Answering: Exact Set versus Snippets Set A study submitted in partial fulfillment of the requirements for the degree of Master of Science in Information Management at THE UNIVERSITY OF SHEFFIELD by Azzah Al-Maskari September 2005 Abstract Although existing information retrieval systems have succeeded in searching for relevant documents, there is a need for systems that allow users to ask a question in everyday language and receive an answer quickly and concisely. The aim of such systems is to extract a short paragraph or a sentence from a document that contains the answer to the question The purpose of this work is to: 1) investigate if there a measurable difference in task performance using exact and snippets sets in Cross Language Question Answering (CLQA) systems; 2) judge if there is a preference for exact over snippet retrieval or vice versa; 3) assess if there is a difference in users’ search behavior when they are searching exact over snippet sets, such as level of confidence; 4) examine the accuracy of Systran machine translation (MT) system. 197 questions were compiled from the TREC QA track and ran through AnswerFinder question answering system. AnswerFinder was able to answer 42.6% of the questions correctly. These questions were broken down into three types: factoid, definition and list. The system was able to answer factoid questions best 68.5% and definition question the least 12%. These questions were translated from English into Arabic, by the evaluator, and back into English, using Systran, and then were run over AnswerFinder again. AnswerFinder was able to answer 10.2% of the translated questions. The translation has inhibited the effectiveness of AnswerFinder by 30.4%. Overall, the precision and recall of the multilingual task are 40.6% and 30%, respectively, below the monolingual task. A user test experiment was conducted to investigate the participants’ performances, feelings and opinions about QA tasks. Thus, 100 factoid questions were judged by 45 participants. The conclusions drawn from the tests’ results were that the 2 participants preferred and performed better in judging the snippet set even though their confidence in judging the snippet set was not much higher than the exact set. The reason for their preferences was apparently because snippets provide context whereas exact set only provide a single answer string. 3 Acknowledgements I wish to thank Dr. Mark Sanderson for his supervision and encouragement throughout this project. I also wish to thank my family and friends for their support and encouragement. Thanks are also due to the Mark Greenwood for using AnswerFinder system and the students for undertaken the users’ tests. 4 Table of Contents CHAPTER 1 INTRODUCTION……………………………………… 8 1. Research Background…………………………….................... 8 2. Research Aims and Objectives……………………………….. 9 3. Research Barriers……………………………………………... 11 4. Dissertation Structure………………………………………… 12 CHAPTER 2 LITERATURE REVIEW…………………………........ 13 2.1 Question Answering Architecture……………….................... 13 2.2 Evaluation Issues……………………………………………. 14 2.2.1 Measures for IR systems……………………........... 14 2.2.2 Measures for QA systems……………………......... 15 2.3 Interactive Cross-Language Question Answering System….. 15 2.4 AnswerFinder System……………………………………….. 17 2.5 Systran Machine Translation System ………………………. 18 2.6 Some Important Characteristics of Arabic Language……….. 19 2.7 Machine Translation…………………………………………. 20 2.7.1 Machine Translation Evaluation Effort……………. 21 2.7.2 Evaluation Methodology…………………………… 22 2.7.2.1 Evaluation of the Lexicon………………………… 22 2.7.2.2 Evaluation of Grammatical Correctness…………. 23 2.7.2.3 Evaluation of Semantic Correctness……………… 24 2.7.2.4 Evaluation of pragmatic Correctness: Pronoun and Case Endings................................................................................. 24 2.8 Summary……………………………………………………….. 26 CHAPTER 3 METHODOLOGY………………………………….. 27 5 3.1 Experimental Procedure and Design of the User’s Tests…… 27 3.2 Experimental Procedure and Design to Evaluate Systran…… 28 CHAPTER 4 SYSTRAN EVALUATION………………………… 31 4.2 Factoid Questions………………………………………….. 31 4.2 Definition Questions…………………………………......... 32 4.3 List Questions……………………………………………… 33 4.5 Evaluation Scheme: Effectiveness of Systran ………......... 34 4.5.1 Grammatical Coverage of Systran…………………......... 34 4.5.2 Lexical Semantic of Systran…………………………….. 36 4.5.3 Resolution of Pronouns and Case Ending………………. 36 CHPATER 5 EVALUATION SCHEME: EFFETUVNESS OF ANSWERFINDER QA SYSTEM…………………… 39 5.1 Factoid Questions…………………………………………... 39 5.2 Definition Questions………………………………….......... 42 5.3 List Questions……………………………………………… 43 5.4 Overall Performance of AnswerFinder………………......... 43 5.5 The Effectiveness of AnswerFinder combined with Systran’ Translation Results…………………………………………….. 44 CHAPTER 6 USERS TEST EXEPRIMENT……………………… 48 6.1 Experimental Environments………………………………… 48 6.2 Experimental Conditions…………………………………… 49 6.3 Participants Profile…………………………………………. 51 6.4 Test Materials………………………………………………. 52 6 6.4.1Background Questionnaire……………………………….. 53 6.4.2 Test Information and Test Instruction…………… 53 6.4.3 Orientation Questions.............................................. 53 6.4.4 Task List…………………………………………. 54 6.4.5 Post-Test Questionnaire…………………………. 55 6.4.6 Post-Test Interview……………………………… 55 6.5 Evaluator Role……………………………………………. 55 6.6 Pilot Test……………………………………………………. 56 6.7 Scenario for Operating the User Test………………………... 57 6.8 User Interface……………………………………………….. 58 6.9 Summary……………………………………………………. 59 CHAPTER 7 PRESENATION AND ANALYSIS OF THE EXPERIMENTAL RESULTS…………………… ….. 60 7.1 Introduction………………………………………………….. 60 7.2 Evaluation Measures………………………………………… 60 7.3 The Test Results……………………………………………... 61 7.3.1 Accuracy of the Participants in Identifying the Correct Answer……………………………………………………………… 61 7.3.2 Rapidity of the Judgments…………………………… 64 7.3.3 Confidence of the Participants on Their Judgments… 65 7.3.4 Effects of Question’s Familiarity and Cultural Bias on the Participants’ performance…………………………………. 66 7.3.5 Aid of the Snippet in Identifying the Correct Answer… 68 7.3.6 Number of Answers Preferred by the Participants…… 70 7.3.7 User’s Familiarity of the Questions and its Relevancy to their Judgments…………………………………………… 71 7.3.8 Opinions and Comments of the Participants on Their Tasks………………………………………………………. 73 7.3.9 Conclusion from the Users Test Results……………. 74 7 7.3.10 Summary…………………………………………………… 76 CHPATER 8 CONCLUSION AND FUTURE RESEARCH……… 78 8.1 Conclusions……………………………………………........ 78 8.2 Objectives Achieved………………………………….......... 79 8.3 Future Research………………………………………......... 82 BIBLIOGRAPGY……………………………………………………….. 85 APPENDIX……………………………………………………………… 94 8 CHAPTER 1 INTRODUCTION 1.1 Research Background The increasing interest in providing various information on the Web has heightened the need for a more sophisticated search tool to locate the desired information. Although existing information retrieval (IR) systems have succeeded in searching for relevant documents, there is a need for systems that allow users to ask a question in everyday language and receive an answer quickly and concisely. The aim of such systems is to extract a short paragraph or a sentence from a document that contains the answer to the question. Question Answering (QA) systems have been evaluated over the last six years at the Text Retrieval Conference (TREC) campaigns. The TREC QA tracks have steadily evolved over the years. Not only factoid questions, but also new types of questions such as how-questions, list questions, and definition questions have been considered. Also, QA systems’ ability to return exact answers instead of longer text as an output has been addressed Maybury (2002) emphasizes that multilinguality represents a new area in QA research and a challenging issue towards the development of more complex systems. The first multilinguality QA track at CLEF took place in 2003. Multilinguality enables the user to pose a query in a language that is different from the language of the reference corpus. This study focuses on passage and document retrieval. Many passage retrieval techniques have been described in the context of improving the performance of document retrieval e.g., Salton (1993) and Callan (1994). Tellex et al. (2003) present a quantitative evaluation of various passage retrieval algorithms and explore their relationship to document retrieval. Passage form a natural unit of responses for QA systems; Lin et al. (2003) shows that users prefer a paragraph-sized text over an exact 9 Introduction phrase as the answer to their questions, because paragraph-sized text provides a context. Users’ preference to paragraph-sized text is due to the presence of pronouns, since sentences with pronouns taken out of context often cannot be meaningfully interpreted. Moreover, López-Ostenero et al. (2004) find that users prefer passage retrieval to document retrieval. Because passage retrieval is faster and simpler while translated documents are noisy and discouraging. However, document context is sometimes useful to verify the correctness of the answer. Hellex (2003) asserts that the quality of the passage retriever depends in part on the quality of the document retriever. 1.2 Research Aims and Objectives The Cross Language Information Retrieval (CLIR) 2003 and 2004 have been concerned about European languages. For example, Figuerola et al. (2004) explore the effects of interaction with Spanish users in suggesting