Content Selection in Multi-Document Summarization
Total Page:16
File Type:pdf, Size:1020Kb
University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2015 Content Selection in Multi-Document Summarization Kai Hong University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/edissertations Part of the Computer Sciences Commons Recommended Citation Hong, Kai, "Content Selection in Multi-Document Summarization" (2015). Publicly Accessible Penn Dissertations. 1765. https://repository.upenn.edu/edissertations/1765 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/1765 For more information, please contact [email protected]. Content Selection in Multi-Document Summarization Abstract Automatic summarization has advanced greatly in the past few decades. However, there remains a huge gap between the content quality of human and machine summaries. There is also a large disparity between the performance of current systems and that of the best possible automatic systems. In this thesis, we explore how the content quality of machine summaries can be improved. First, we introduce a supervised model to predict the importance of words in the input sets, based on a rich set of features. Our model is superior to prior methods in identifying words used in human summaries (i.e., summary keywords). We show that a modular extractive summarizer using the estimates of word importance can generate summaries comparable to the state-of-the-art systems. Among the features we propose, we highlight global knowledge, which estimate word importance based on information independent of the input. In particular, we explore two kinds of global knowledge: (1) important categories mined from dictionaries, and (2) intrinsic importance of words. We show that global knowledge is very useful in identifying summary keywords that have low frequency in the input. Second, we present a new framework of system combination for multi-document summarization. This is motivated by our observation that different systems generate very different summaries. For each input set, we generate candidate summaries by combining whole sentences produced by different systems. We show that the oracle summary among these candidates is much better than the output from the systems that we have combined. We then introduce a support vector regression model to select among these candidates. The features we employ in this model capture the informativeness of a summary based on the input documents, the outputs of different systems, and global knowledge. Our model achieves considerable improvement over the systems that we have combined while generating summaries up to a certain length. Furthermore, we study what factors could affect the success of system combination. Experiments show that it is important for the systems combined to have a similar performance. Degree Type Dissertation Degree Name Doctor of Philosophy (PhD) Graduate Group Computer and Information Science First Advisor Ani Nenkova Second Advisor Mitchell P. Marcus Keywords computational linguistics, global knowledge, keyword identification, multi-document summarization, natural language processing, system combination Subject Categories Computer Sciences This dissertation is available at ScholarlyCommons: https://repository.upenn.edu/edissertations/1765 CONTENT SELECTION IN MULTI-DOCUMENT SUMMARIZATION Kai Hong A DISSERTATION in Computer and Information Science Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy 2015 Supervisor of Dissertation Co-Supervisor of Dissertation Ani Nenkova, Associate Professor, Mitchell P. Marcus, Professor, Computer and Information Science Computer and Information Science Graduate Group Chairperson Lyle Ungar, Professor, Computer and Information Science Dissertation Committee: John M. Conroy, IDA Center for Computing Sciences Sampath Kannan, Professor, Computer and Information Science Mark Liberman, Professor, Linguistics Lyle Ungar, Professor, Computer and Information Science CONTENT SELETION IN MULTI-DOCUMENT SUMMARIZATION COPYRIGHT Kai Hong 2015 To Yumeng, without whom I would never have completed this dissertation. iii Acknowledgments First, I would like to thank my advisors, Ani Nenkova and Mitch Marcus. Ani gave me innumerable feedbacks on formulating research problems, designing exper- iments, and interpreting results. Her detailed suggestions on how to write papers and give talks also shape me into as a better researcher. I owe a debt of gratitude to Mitch, who has advised me throughout my dissertation stage. His invaluable comments greatly improved the quality of the thesis, as well as my understanding of computational linguistics. Furthermore, whenever I encountered difficulties, he was always there to help. I will never forget his encouragement, guidance, integrity, and unwavering support. I would like to express my gratitude to John Conroy for being a wonderful collab- orator and teaching me about proportion test and Philadelphia pretzel; to Sampath Kannan and Mark Liberman for insightful questions that lead me to look into prob- lems from different perspectives; to Lyle Ungar for inspiring comments on research throughout my years at Penn. Chris Callison-Burch served on my WPE-II committee and gave me detailed feedback. I thank him for his encouragement and steadfast belief in my ability. I am lucky to be surrounded by many outstanding colleagues at the Penn NLP group: Houwei Cao, Anne Cocos, Mingkun Gao, Junyi Li, Constantine Lignos, Xi Lin, Annie Louis, Ellie Pavlick, Emily Pitler, Daniel Preotiuc, Andy Schwartz, Joao Sedoc, Wei Xu, Rui Yan, and Qiuye Zhao. Thanks also go to Mike Felker, Cheyrl Hickey, and other administration staff at CIS. iv I am indebted to many other people on my way of becoming a computer scientist. Eric Chang and Yan Xu were my mentors when I was doing internship at Microsoft Research Asia. I thank them for introducing me into natural language processing and collaborating on my first paper. I also enjoyed an unforgettable summer at Microsoft Research Silicon Valley, where my mentor, Dilek Hakkani-Tur, showed me what research is like in industry. I would also like to thank my other collaborators: Benoit Favre, Christian Kohler, Alex Kulesza, Hui Lin, Mary March, Amber Parker, Pengjun Pei, Junichi Tsujii, Ragini Verma, and Ye-Yi Wang. I would also like to thank my friends for their encouragements and companion. An incomplete list would be: Chen Chen, Xue Chen, Arthur Azevedo De Amorim, Chang Guo, Yu Hu, Shahin Jabbari, Chaoran Li, Wei Li, Yang Li, Sheng Mao, Salar Moarref, Hua Qiang, Mukund Raghothaman, Chen Sun, Xiaofan Tong, Fanxi Wang, Zhirui Wang, Dafeng Xu, Meng Xu, Hongbo Zhang, Mabel Zhang, and Nan Zheng. Finally, I would like to thank Yumeng Ou for being my significant other, encour- aging me through the most difficult periods during graduate school. I have been more than fortunate to share with her all the happiness and sorrow through these years, and look forward to the wonderful years that lie ahead of us. My deepest gratitude goes to my parents, Xiang Hong and Shuang Liu, who cultivated my in- terest of knowledge, nurtured my curiosity, taught me the values of determination, diligence, and integrity, and supported me no matter what happens. Without their unconditional love, I would never have become who I am today. v ABSTRACT CONTENT SELECTION IN MULTI-DOCUMENT SUMMARIZATION Kai Hong Ani Nenkova Mitchell P. Marcus Automatic summarization has advanced greatly in the past few decades. However, there remains a huge gap between the content quality of human and machine sum- maries. There is also a large disparity between the performance of current systems and that of the best possible automatic systems. In this thesis, we explore how the content quality of machine summaries can be improved. First, we introduce a supervised model to predict the importance of words in the input sets, based on a rich set of features. Our model is superior to prior methods in identifying words used in human summaries (i.e., summary keywords). We show that a modular extractive summarizer using the estimates of word importance can generate summaries comparable to the state-of-the-art systems. Among the features we propose, we highlight global knowledge, which estimate word importance based on information independent of the input. In particular, we explore two kinds of global knowledge: (1) important categories mined from dictionaries, and (2) intrinsic importance of words. We show that global knowledge is very useful in identifying summary keywords that have low frequency in the input. Second, we present a new framework of system combination for multi-document summarization. This is motivated by our observation that different systems generate very different summaries. For each input set, we generate candidate summaries by combining whole sentences produced by different systems. We show that the oracle summary among these candidates is much better than the output from the systems that we have combined. We then introduce a support vector regression model to select among these candidates. The features we employ in this model capture the vi informativeness of a summary based on the input documents, the outputs of differ- ent systems, and global knowledge. Our model achieves considerable improvement over the systems that we have combined while generating summaries up to a certain length. Furthermore, we study what factors could affect the success of system com- bination. Experiments