Enhancing Summaries with Conceptual Spaces

Enhancing Summaries with Conceptual Spaces Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science (by Research) in Computer Science and Engineering by Jayant Gupta 200802018 [email protected] Search and Information Extraction Lab International Institute of Information Technology Hyderabad - 500 032, INDIA October 2013 Copyright ⃝c Jayant Gupta, 2013 All Rights Reserved International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled “Enhancing Summaries with Conceptual Spaces” by Jayant Gupta, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Prof. Vasudeva Varma To Curiosity Acknowledgments First and foremost, I wish to thank Prof. Vasudeva Varma for being my advisor and guide to my research work. His presence gave me support, his advice gave me direction and his belief gave me motivation to pursue research with utmost dedication. I thank Sudheer Kovelamudi for advising me at my initial stages of research work. I thank Aditya Mogdala, Kushal Dave, Sambhav Jain and Nikhil Priyatam for giving me valuable feedback and guid- ance during the most critical times of my research. I thank Riayz Ahmad Bhat who helped me to develop my writing skills. I thank Sarvesh Ranade with whom I worked and had a great learning experience during the final stages of my research. I thank my batchmates for having spent some great times with them during the course of my stay at IIIT. I thank Akshay Mani Agarwal, for being a brother when I needed one. I thank Sports Fraternity of IIIT, especially Kamalakar sir, to help me nurture my passion for sports. I thank all the members of SIEL lab especially, Ajay Dubey and Harshit Jain for making the period of research joyous and satisfactory. Finally I thank my parents for being there in my support and having faith in me during my research. They gave me freedom and stood by my decisions. In the end their patience with me helped to give justice and quality to my research work. v Abstract Library science is the predecessor of the present information retrieval (IR) technology. Since decades, libraries were the source of information and knowledge to one and all. A library meant a mammoth structure, home to thousands of books and journals. People from far away lands came to great libraries where they used to learn and later contributed. These people then became the source of information to other people. In present society, we will find a shift in paradigm. Today, thousands of books can be stored on a hand held device or a personal computer like our own personal library. Furthermore, Internet has given the capability of easy accessible knowledge to every person. With age internet services have matured and people are more comfortable to share and contribute content via this medium. This resulted in multiple sources of information pertaining to any single topic. There is no bound on the language of each of these sources. So, now we have many sources in multiple languages. This has led to a whole new set of problems that are needed to be solved by the IR community. The main focus of these problems is management of huge information. They need to figure out what different methods can be used to understand and impart structure to the information. Furthermore, individual needs play an important role to decide the information managing strategies. So, the focus has shifted from getting the information to getting the right information. This work is a step in that direction. We address the problem of Text Summarization and its multilingual solution. Although text summarization is a relatively older problem, the internet age has given a new direction and importance to it. Therefore, the summarization methods are needed to be improvised and novel solutions are needed. In our methodology, we initially focus on changing the heuristic based representation of text to meaningful representation of the text. We have used Hyperspace Analogue to Language (HAL) to represent the text, it is a computational model based upon a cognitive model called Conceptual Spaces. The properties of conceptual spaces allow us to represent words and sentences in the same space, called HAL space. Then, we modeled the problem of summarization as selecting those set of sentences which can represent the source text in the most meaningful manner. To handle the redundancy in summaries we propose a novel mechanism which is effective in the HAL space. Our method is language independent making it scalable over different languages. We provide useful insights into formation of conceptual space using textual examples and behavior of metrics using intrinsic experiments. Intrinsic experiments and extrinsic evaluation were conducted on DUC 2001 and DUC 2002 datasets. The results of extrinsic vi vii evaluation show that quality of summaries is preserved over summary size and the system outperforms, previous state-of-the-art systems for longer summaries while being comparable for shorter summaries. Multilingual summarization is a relatively new field in text summarization. We focused on studying two aspects of multilingual summarization first, “Added noisy information” (related to number of languages of the source documents) and second, suitability of monolingual summarizers in a multilingual domain. For our work, we use automatic translation systems along with four generic summarizer systems (including CMDS). These summarizers are used to form monolingual summaries (separately) in different languages. Quality of a summary (for each language) is obtained by the Jensen-Shannon divergence measure between the summary distribution and input distribution. To form multilingual summary, weights proportional to the quality are used to combine the monolingual summaries. This work is done in three languages, namely English, Hindi and Telugu. The experimental results are encouraging and show that as the number of interacting language increase quality of multilingual summaries improve. We also find that compared to structural methods, contextual methods are more suitable for the task of multilingual summarization. Finally, to show that HAL features are effective for different summarization tasks other than generic summarizers, we use them as one of the key features to form summaries of on-line conversation in the domain of debates. The experimental results (ROUGE scores) show that our summaries are better compared to previous state-of-the-art system. One major difference between our approach and previous approach was the use of HAL features to create summaries. This shows that addition of HAL features to sentiment related features is helpful to summarize sentiment rich text. To conclude, we explain the need of meaningful representation of text to improve summary quality. Our work establishes HAL as a quality representation of text and useful for the task of summarization. We also give a summary formation conjecture and the summaries thus formed are highly efficient which improves as the size of summary increase. We also show that multilingual summarization is not only needed but is useful to solve the problem of information overload. Our work brings out various challenges involved in the task of multilingual summarization especially, the evaluation of multilingual summaries. This work adds the component of multilingual summarization to the solution of information overload. Contents Chapter Page 1 Introduction :::::::::::::::::::::::::::::::::::::::::: 1 1.1 Generic Text Summarization . 2 1.2 Multilingual Summarization . 3 1.3 Evaluation of Summaries . 4 1.4 Problem Description . 5 1.5 Overview of our approach . 5 1.5.1 Generic Summarization . 5 1.5.2 Multilingual Summarization . 6 1.5.3 Summarization of Online Conversations in the domain of Debates . 6 1.6 Contributions of this work . 7 1.7 Thesis Organization . 7 2 Related Work ::::::::::::::::::::::::::::::::::::::::: 9 2.1 Types of Summarization . 9 2.2 Generic Summarization . 11 2.2.1 Feature based methods . 11 2.2.2 Graph based methods . 11 2.2.3 Lexical chain based methods . 12 2.2.4 Other relevant methods . 12 2.2.5 HAL based methods . 13 2.3 Multilingual Summarization . 13 2.4 Summarization of Online Conversations in the domain of Debates . 14 2.5 Summary Evaluation . 15 2.5.1 ROUGE . 15 2.5.1.1 ROUGE-N . 15 2.5.1.2 ROUGE-L . 16 2.5.1.3 ROUGE-SU* . 16 2.5.2 Jensen-Shannon divergence . 17 2.6 Concluding Remarks . 18 3 Multi-Document Summarization Using Conceptual Spaces :::::::::::::::::: 19 3.1 Motivation of our Approach . 19 3.2 Text Representation Overview . 20 3.3 Conceptual spaces as a representative model . 21 3.3.1 Gardenfors’¨ Conceptual Spaces . 21 viii CONTENTS ix 3.3.2 Forming Conceptual Spaces using HAL . 22 3.3.3 Sentences in Conceptual Space . 23 3.4 Conceptual Multi-Document Summarization (CMDS) . 26 3.4.1 Principle . 28 3.4.2 Metrics . 29 3.4.3 Redundancy Removal . 29 3.5 Experimental Setup . 30 3.6 Results and Discussion . 32 3.6.1 Intrinsic Experiments . 32 3.6.1.1 Effect of variable window size: . 32 3.6.1.2 Effect of variable metrics: . 32 3.6.2 Extrinsic Evaluation . 33 3.7 Summary and Conclusion . 35 4 Multilingual Multidocument Text Summarization ::::::::::::::::::::::: 38 4.1 MultiLingual Summarization using Jensen-Shannon Divergence . 39 4.1.1 Translation . 40 4.1.2 Generic Summarizers . 40 4.1.3 Jensen-Shannon (JS) Divergence . 41 4.1.4 Final Summary . 41 4.1.4.1 Redundancy Removal . 42 4.2 Dataset and Evaluation Metric . 44 4.3 Experiments . 44 4.4 Results and discussion . 45 4.5 Conclusion and Future Work . 46 5 Summarization of Online Conversations in the domain of Debates ::::::::::::::: 48 5.1 Approach Used . 48 5.1.1 Calculating Topic Relevance . 49 5.1.1.1 Topic Directed Sentiment Score .

Load more