Answer-Biased Summarization
Total Page:16
File Type:pdf, Size:1020Kb
Answer-biased Summarization A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Evi Yulianti Bachelor of Computer Science (Universitas Indonesia) Master of Computer Science (Universitas Indonesia and RMIT University) School of Science College of Science, Engineering, and Health RMIT University, Australia June 2018 Declaration I certify that except where due acknowledgement has been made, the work is that of the author alone; the work has not been submitted previously, in whole or in part, to qualify for any other academic award; the content of the thesis is the result of work which has been carried out since the official commencement date of the approved research program; any editorial work, paid or unpaid, carried out by a third party is acknowledged; and, ethics procedures and guidelines have been followed. Evi Yulianti School of Science RMIT University Tuesday 26th June, 2018 ii Acknowledgments First and foremost, I would like to thank Allah the Almighty, for all the blessings he has bestowed upon me so that I could complete this thesis. Next, I would like to express my special thanks to both of my supervisors, Professor Mark Sanderson and Associate Professor Falk Scholer, for their guidance, support, and motivation throughout my study. It would not have been possible for me to complete this thesis without their kind supervision. I thank Professor Bruce Croft for his useful feedback on this work. I also thank Ruey-Cheng Chen, who has worked with me over the last two years; thank you for all your assistance and for the insightful discussions about this work. I would like to thank my close friends who supported me during all the ups and downs in my study: Ramya Rachmawati, Shafiza Mohd Shariff, Fatimah Abdullah M Alqahtani, Ebtesam Alghamdi, Guangli Huang, Xi Chen, Husna Sarirah Husin, and Sharin Hazlin Huspi. I am truly blessed to have had all of you on the journey of my PhD study. I also thank all of my friends who have helped me to complete this thesis. I would like to acknowledge the support of the Indonesia Endowment Fund for Ed- ucation (LPDP) for providing me with the scholarship for my PhD study. In addition, this work was also supported in part by the ARC Discovery Grant. My deep gratitude goes to my family. To my mother, Sugiarti Johariyah; my father, Hartono; and my sisters, Novy Herviyani and Atika Indriani, thank you for your endless prayers and emotional support that kept me motivated to see this work through to the end. Finally, to my beloved husband, Yoppy Setyo Duto, thank you for your prayers, support, sacrifice, and patience in waiting for me to complete my study. All of you have motivated me to complete this work, and I dedicate this thesis to you. iii Credits Portions of the material in this thesis have previously appeared in the following publi- cations (sorted from the newest to the oldest): • Evi Yulianti, Ruey-Cheng Chen, Falk Scholer, W. Bruce Croft, and Mark Sanderson. \Ranking Documents by Answer Passage Quality." to appear in Proceedings of the 41th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2018. This paper proposes using quality features extracted from answer-biased summaries to improve document ranking. Some portions of this work are described in Chapter 5. • Evi Yulianti, Ruey-Cheng Chen, Falk Scholer, W. Bruce Croft, and Mark Sanderson. \Document Summarization for Answering Non-Factoid Queries." IEEE Transactions on Knowledge and Data Engineering, 30(1), pp. 15-28, 2018. This paper proposes some approaches using related Community Question An- swering (CQA) content to help with the extraction of answer-biased summaries from documents. This work forms the basis of Chapter 3. • Evi Yulianti, Ruey-Cheng Chen, Falk Scholer, and Mark Sanderson. \Using Semantic and Context Features for Answer Summary Extraction." In Proceedings of the Australasian Document Computing Symposium, pp. 81-84, 2016. This paper presents the investigation of using semantic and context information to improve the extraction of answer-biased summaries from documents. This work is covered in Chapter 4. • Rana Malhas, Marwan Torki, Rahma Ali, Evi Yulianti, and Tamer Elsayed. \Real, Live, and Concise: Answering Open-Domain Questions with Word Em- bedding and Summarization." In Proceedings of the Text Retrieval Conference, 2016. This paper presents our participation at TREC Live Question Answering (LiveQA) track in 2016. This work is included in Chapter 6. • Ruey-Cheng Chen, J. Shane Culpepper, Tadele Tadela Damessie, Timothy Jones, Ahmed Mourad, Kevin Ong, Falk Scholer, and Evi Yulianti. \RMIT at the TREC 2015 LiveQA Track." In Proceedings of the Text Retrieval Conference, 2015. This paper presents our participation at TREC Live Question Answering (LiveQA) track in 2015. This work is included in Chapter 6. • Evi Yulianti. \Finding Answers in Web Search." In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval, pp. 1069-1069, 2015. This paper contains the general idea and the methodology plan of using related Community Question Answering (CQA) content to generate answer-biased summaries from documents. Other publications made during the candidature that are not directly related with this thesis (sorted from the newest to the oldest): • Ruey-Cheng Chen, Evi Yulianti, and Mark Sanderson. \On the Benefit of Incorporating External Features in a Neural Architecture for Answer Sentence Selection." In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1017-1020, 2017. • Evi Yulianti, Sharin Huspi, and Mark Sanderson. \Tweet-biased summariza- tion." Journal of the Association for Information Science & Technology, 67(6), pp. 1289-1300, 2016. iv Contents Contents v List of Figures x List of Tables xii Abstract 1 1 Introduction 2 1.1 Motivation . .2 1.2 Problem Statement . .5 1.3 Scope of the Thesis . .7 1.4 Research Objectives . .7 1.5 Research Questions . .7 1.6 Thesis Outline . .9 1.7 Thesis Structure . .9 2 Related Work 11 2.1 Answers in Search Results . 11 2.2 Factoid Question Answering . 15 2.3 Community Question Answering (CQA) . 16 2.4 Non-Factoid Question Answering . 20 2.4.1 Question and Answer Ranking in CQA Collections . 21 2.4.2 Answer Extraction from Web Documents . 23 2.5 Document Summarization . 24 2.6 Document Ranking . 27 2.6.1 Document-Based Scoring Using Local Collection . 28 v 2.6.2 Document-Based Scoring Using External Collection . 29 2.6.3 Passage-Based Scoring Using Local Collection . 30 2.6.4 Passage-Based Scoring Using External Collection . 32 3 Using Community Question Answering and User Queries for Answer- Biased Summarization 33 3.1 Introduction . 33 3.2 Data Collection . 36 3.2.1 Queries, Documents, and Ground Truth Answers . 38 3.2.2 Related CQA Content . 40 3.2.3 Quality Judgment of Related CQA Content . 41 3.2.3.1 CrowdFlower Design . 41 3.2.3.2 CrowdFlower Results . 43 3.2.3.3 Grouping Data According to the Quality of Related CQA Content . 46 3.3 Methods . 48 3.3.1 Optimization-based Methods . 48 3.3.1.1 QueryOpt (query-biased) . 50 3.3.1.2 AnswerOpt (CQA-answer-biased) . 51 3.3.1.3 ExpQueryOpt (expanded-query-biased) . 52 3.3.2 Learning-to-rank-based Method . 52 3.3.3 Baseline Methods . 55 3.3.3.1 Lead . 55 3.3.3.2 DocOpt . 55 3.3.3.3 MEAD . 55 3.3.3.4 RelSent . 56 3.3.3.5 LCA (Local Context Analysis) . 57 3.3.3.6 QL (Query Likelihood) . 58 3.3.3.7 MK . 58 3.3.4 Evaluation Metrics . 59 3.4 Experiments . 59 3.4.1 Extracting Summaries When the Related CQA Content is Available 60 vi 3.4.1.1 Using Mixed Quality of CQA Content . 60 3.4.1.2 Using the Individual Quality of CQA Content . 63 3.4.1.3 Document Summaries vs CQA Answers . 71 3.4.1.4 Example of Answer-biased Summaries . 73 3.4.2 Extracting Summaries When the Related CQA Content May be Unavailable . 75 3.5 Further Investigation . 76 3.5.1 Using the Key Concepts of a Query . 76 3.5.2 Using Another Dataset . 78 3.5.2.1 MSMARCO Dataset . 78 3.5.2.2 Experiments . 80 3.5.2.3 Analysis . 81 3.6 Discussion . 83 3.7 Chapter Summary . 84 4 Using Semantic and Context Features for Answer-Biased Summariza- tion 86 4.1 Introduction . 86 4.2 Dataset . 87 4.3 Summarization Methods . 88 4.4 Experiments . 92 4.4.1 Baseline Methods . 93 4.4.1.1 Factoid QA method . 93 4.4.1.2 Query-biased Summarization method . 94 4.4.2 Evaluation Methods . 94 4.5 Result . 95 4.5.1 The Effectiveness of the Factoid QA Method . 95 4.5.2 The Effectiveness of Using Semantic and Context Features . 96 4.5.3 Correlation between Measures . 97 4.6 Analysis . 99 4.6.1 Effect of Semantic and Context Features . 99 4.6.2 Ablation Analysis . 100 vii 4.6.3 Example of Answer-biased Summaries . 101 4.7 Further Investigation . 104 4.8 Chapter Summary . 107 5 Re-ranking Documents using Answer-Biased Summaries 109 5.1 Introduction . 109 5.2 Quality-Biased Ranking Using Document Summaries . 111 5.2.1 Retrieving Initial Document Ranking . 111 5.2.2 Generating Answer-Biased Summaries of Documents . 113 5.2.3 Extracting Quality Estimate Features from Document Summaries 114 5.2.4 Ranking Documents using the Feature-based Linear Model . 117 5.3 Experiments . 118 5.3.1 Setup . 118 5.3.1.1 Test Collections .