Enhancing Summaries with Conceptual Spaces

Thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science (by Research) in Computer Science and Engineering

by

Jayant Gupta 200802018 [email protected]

Search and Information Extraction Lab International Institute of Information Technology Hyderabad - 500 032, INDIA October 2013 Copyright ⃝ Jayant Gupta, 2013 All Rights Reserved International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Enhancing Summaries with Conceptual Spaces” by Jayant Gupta, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Prof. Vasudeva Varma To Curiosity Acknowledgments

First and foremost, I wish to thank Prof. Vasudeva Varma for being my advisor and guide to my research work. His presence gave me support, his advice gave me direction and his belief gave me motivation to pursue research with utmost dedication. I thank Sudheer Kovelamudi for advising me at my initial stages of research work. I thank Aditya Mogdala, Kushal Dave, Sambhav Jain and Nikhil Priyatam for giving me valuable feedback and guid- ance during the most critical times of my research. I thank Riayz Ahmad Bhat who helped me to develop my writing skills. I thank Sarvesh Ranade with whom I worked and had a great learning experience dur- ing the final stages of my research. I thank my batchmates for having spent some great times with them during the course of my stay at IIIT. I thank Akshay Mani Agarwal, for being a brother when I needed one. I thank Sports Fraternity of IIIT, especially Kamalakar sir, to help me nurture my passion for sports. I thank all the members of SIEL lab especially, Ajay Dubey and Harshit Jain for making the period of research joyous and satisfactory. Finally I thank my parents for being there in my support and having faith in me during my research. They gave me freedom and stood by my decisions. In the end their patience with me helped to give justice and quality to my research work.

v Abstract

Library science is the predecessor of the present information retrieval (IR) technology. Since decades, libraries were the source of information and knowledge to one and all. A library meant a mammoth structure, home to thousands of books and journals. People from far away lands came to great libraries where they used to learn and later contributed. These people then became the source of information to other people. In present society, we will find a shift in paradigm. Today, thousands of books can be stored on a hand held device or a personal computer like our own personal library. Furthermore, Internet has given the capability of easy accessible knowledge to every person. With age internet services have matured and people are more comfortable to share and contribute content via this medium. This resulted in multiple sources of information pertaining to any single topic. There is no bound on the language of each of these sources. So, now we have many sources in multiple languages. This has led to a whole new set of problems that are needed to be solved by the IR community. The main focus of these problems is management of huge information. They need to figure out what different methods can be used to understand and impart structure to the information. Furthermore, individual needs play an important role to decide the information managing strategies. So, the focus has shifted from getting the information to getting the right information. This work is a step in that direction. We address the problem of Text Summarization and its multi- lingual solution. Although text summarization is a relatively older problem, the internet age has given a new direction and importance to it. Therefore, the summarization methods are needed to be improvised and novel solutions are needed. In our methodology, we initially focus on changing the heuristic based representation of text to meaningful representation of the text. We have used Hyperspace Analogue to Language (HAL) to represent the text, it is a computational model based upon a cognitive model called Conceptual Spaces. The properties of conceptual spaces allow us to represent words and sentences in the same space, called HAL space. Then, we modeled the problem of summarization as selecting those set of sentences which can represent the source text in the most meaningful manner. To handle the redundancy in summaries we propose a novel mechanism which is effective in the HAL space. Our method is language independent making it scalable over different languages. We provide useful insights into formation of conceptual space using textual examples and behavior of metrics using intrinsic experiments. Intrinsic experiments and extrinsic evaluation were conducted on DUC 2001 and DUC 2002 datasets. The results of extrinsic

vi vii evaluation show that quality of summaries is preserved over summary size and the system outperforms, previous state-of-the-art systems for longer summaries while being comparable for shorter summaries. Multilingual summarization is a relatively new field in text summarization. We focused on studying two aspects of multilingual summarization first, “Added noisy information” (related to number of lan- guages of the source documents) and second, suitability of monolingual summarizers in a multilingual domain. For our work, we use automatic translation systems along with four generic summarizer sys- tems (including CMDS). These summarizers are used to monolingual summaries (separately) in different languages. Quality of a summary (for each language) is obtained by the Jensen-Shannon diver- gence measure between the summary distribution and input distribution. To form multilingual summary, weights proportional to the quality are used to combine the monolingual summaries. This work is done in three languages, namely English, Hindi and Telugu. The experimental results are encouraging and show that as the number of interacting language increase quality of multilingual summaries improve. We also find that compared to structural methods, contextual methods are more suitable for the task of multilingual summarization. Finally, to show that HAL features are effective for different summarization tasks other than generic summarizers, we use them as one of the key features to form summaries of on-line conversation in the domain of debates. The experimental results (ROUGE scores) show that our summaries are better compared to previous state-of-the-art system. One major difference between our approach and previous approach was the use of HAL features to create summaries. This shows that addition of HAL features to sentiment related features is helpful to summarize sentiment rich text. To conclude, we explain the need of meaningful representation of text to improve summary quality. Our work establishes HAL as a quality representation of text and useful for the task of summariza- tion. We also give a summary formation conjecture and the summaries thus formed are highly efficient which improves as the size of summary increase. We also show that multilingual summarization is not only needed but is useful to solve the problem of information overload. Our work brings out various challenges involved in the task of multilingual summarization especially, the evaluation of multilingual summaries. This work adds the component of multilingual summarization to the solution of information overload. Contents

Chapter Page

1 Introduction ...... 1 1.1 Generic Text Summarization ...... 2 1.2 Multilingual Summarization ...... 3 1.3 Evaluation of Summaries ...... 4 1.4 Problem Description ...... 5 1.5 Overview of our approach ...... 5 1.5.1 Generic Summarization ...... 5 1.5.2 Multilingual Summarization ...... 6 1.5.3 Summarization of Online Conversations in the domain of Debates ...... 6 1.6 Contributions of this work ...... 7 1.7 Thesis Organization ...... 7

2 Related Work ...... 9 2.1 Types of Summarization ...... 9 2.2 Generic Summarization ...... 11 2.2.1 Feature based methods ...... 11 2.2.2 Graph based methods ...... 11 2.2.3 Lexical chain based methods ...... 12 2.2.4 Other relevant methods ...... 12 2.2.5 HAL based methods ...... 13 2.3 Multilingual Summarization ...... 13 2.4 Summarization of Online Conversations in the domain of Debates ...... 14 2.5 Summary Evaluation ...... 15 2.5.1 ROUGE ...... 15 2.5.1.1 ROUGE-N ...... 15 2.5.1.2 ROUGE-L ...... 16 2.5.1.3 ROUGE-SU* ...... 16 2.5.2 Jensen-Shannon divergence ...... 17 2.6 Concluding Remarks ...... 18

3 Multi-Document Summarization Using Conceptual Spaces ...... 19 3.1 Motivation of our Approach ...... 19 3.2 Text Representation Overview ...... 20 3.3 Conceptual spaces as a representative model ...... 21 3.3.1 Gardenfors’¨ Conceptual Spaces ...... 21

viii CONTENTS ix

3.3.2 Forming Conceptual Spaces using HAL ...... 22 3.3.3 Sentences in Conceptual Space ...... 23 3.4 Conceptual Multi-Document Summarization (CMDS) ...... 26 3.4.1 Principle ...... 28 3.4.2 Metrics ...... 29 3.4.3 Redundancy Removal ...... 29 3.5 Experimental Setup ...... 30 3.6 Results and Discussion ...... 32 3.6.1 Intrinsic Experiments ...... 32 3.6.1.1 Effect of variable window size: ...... 32 3.6.1.2 Effect of variable metrics: ...... 32 3.6.2 Extrinsic Evaluation ...... 33 3.7 Summary and Conclusion ...... 35

4 Multilingual Multidocument Text Summarization ...... 38 4.1 MultiLingual Summarization using Jensen-Shannon Divergence ...... 39 4.1.1 Translation ...... 40 4.1.2 Generic Summarizers ...... 40 4.1.3 Jensen-Shannon (JS) Divergence ...... 41 4.1.4 Final Summary ...... 41 4.1.4.1 Redundancy Removal ...... 42 4.2 Dataset and Evaluation Metric ...... 44 4.3 Experiments ...... 44 4.4 Results and discussion ...... 45 4.5 Conclusion and Future Work ...... 46

5 Summarization of Online Conversations in the domain of Debates ...... 48 5.1 Approach Used ...... 48 5.1.1 Calculating Topic Relevance ...... 49 5.1.1.1 Topic Directed Sentiment Score ...... 49 5.1.1.2 Topic Co-occurrence Measure ...... 51 5.1.2 Calculating Document Relevance ...... 51 5.1.3 Calculating Sentiment Relevance ...... 51 5.1.4 Positional and Coverage Relevance ...... 52 5.1.4.1 Sentence Position ...... 52 5.1.4.2 Sentence Length ...... 52 5.1.5 Calculating Relevance of a Dialogue Act ...... 52 5.2 Experimental Setup ...... 53 5.3 Results and Discussion ...... 53 5.4 Conclusion and Future Work ...... 56

6 Conclusions ...... 57

Bibliography ...... 61 List of Figures

Figure Page

3.1 Concept combination in a 3-dimensional conceptual space where the combined concept is more refined...... 23 3.2 Schematic-overview of the complete system ...... 27 3.3 Representation of documents and summary in a 3-dimensional conceptual space. . . . 28 3.4 Summary quality v/s window size (K) ...... 32 3.5 Summary quality v/s xth root of rank metric ...... 33 3.6 Summary quality v/s yth root of weight metric ...... 34 3.7 Graphical representation of ROUGE scores for all the systems ...... 36 3.8 Final CMDS Summary for DUC2001: d15c ...... 37

4.1 Added Noisy Information ...... 39 4.2 Architecture of the system ...... 39

5.1 ROUGE-2 (Average F-measure) scores v/s Summary Size (in words) ...... 55

x List of Tables

Table Page

2.1 Types of Summarization ...... 10

3.1 Average F-measure (ROUGE-2) scores for various state-of-the-art systems ...... 35

4.1 JS Divergence of monolingual summaries ...... 45 4.2 JS Divergence of bilingual summaries ...... 45 4.3 JS Divergence of trilingual summaries ...... 45

5.1 Argument Structure Examples ...... 49 5.2 List of Dependency Relations ...... 50 5.3 Statistics of the dataset ...... 53 5.4 ROUGE Scores (Average F-measure) of System Summaries (1000 words) ...... 55

xi Chapter 1

Introduction

The provides a pool of knowledge where information on any topic is present in abundance. With the evolution of internet, accessibility of information has increased and people are becoming comfortable with digital information. They have started contributing by means of social blogs, articles and through on-line social media. Moreover, internet has been able to dissolve boundaries (virtually) of nations and languages. People with varied linguistic preferences are accessing the web and contributing in their preferred language. The accessibility is leveraged by the advancement in information retrieval techniques. Easy accessibility to such large information is leading to the problems of information overload. Ac- cording to Spier et al. [60] information overload occurs when the amount of input to a system exceeds its processing capacity. Decision makers have fairly limited cognitive processing capacity. Consequently, when information overload occurs, it is likely that a reduction in decision quality will occur. There- fore, quality text tools which can help in management of information have become an important need of modern information retrieval systems. In real applications, Google news 1 shows small snippets. They help the readers to decide if the news is important enough to read. Modern search engines like Google and Bing 2 also show info-boxes which are a small window to the complete search results. For many users that info-box might suffice thereby fulfilling their information need with lesser information. This also means less unwanted information for the user. Thus, any tool which can give a shorter yet accurate description is helpful to manage information. According to Wikipedia3 India ranked 3rd in the number of internet users after United States (2nd) and China (1st). Ironically, India ranked 164th in internet penetration (12.6%) way behind United States (28th) and China (102nd). Presently there are 22 languages recognized by the Constitution of India. According to 2001 Census of India 10.35% of total Indian population were English speakers. The 2005 India Human Development Survey (from surveyed households) reported that among men 72% do not speak English, 28% speak at

1http://www.news.google.com 2http://www.bing.com 3http://en.wikipedia.org/wiki/List of countries by number of Internet users

1 least some English, and 5% are fluent. Among women, the corresponding proportions were 83%, 17% and 3%. Low penetration yet large users suggest vast future opportunities for web based services in India. The statistics show that an increment in internet penetration is required. The physical requirements (infrastructure, fiber cables, wireless connectivity and devices) for the penetration purposes are out of scope for the Information Retrieval community. However, our major concern is the services which can be provided once connectivity is improved in remote areas. Web based services which can cater to the local linguistic requirements of these people would be in demand. Tools which are scalable over multiple languages would be required for information management purposes. Use of variable languages will add the multi-lingual aspect to the problem of information overload. This work focuses on the field of Text Summarization and its multi-lingual solution. Summarization is one of the key fields in information retrieval domain and, is used to manage large set of information.

1.1 Generic Text Summarization

Generic text summarization refers to the generation of summaries computationally which cover the most important points of the source document(s). An efficient summary gives a succinct non-redundant overview of documents without expanding on specific details. Automatic text summarization is a com- plex and challenging area and significant research has been done in the area of text summarization. Pre- vious work can be categorized into different types depending on the way the summaries are generated. Some of these include extractive vs. abstractive, single document vs. multi-document, language specific vs. multi-lingual, query dependent vs. query independent, supervised vs. unsupervised, etc [43]. Principally, unsupervised summarization approaches are broadly classified into graph based, feature based and lexical chain based approaches [25]. Graph based approaches [25, 61, 11] depend upon the rationale that similar sentences should contain identical words. Feature based approaches [10, 32] depend upon all the characteristics which can be used to distinguish two textual entities. Lexical chain based approaches [3, 67, 34] create lexical chains using available knowledge sources (like wordnet [13]). Over the years, the problem has been modeled in various different forms resulting in different meth- ods to solve it. Initial approaches were based upon sentence extraction; later approaches incorporated various language specific features. The additional features made the summaries more robust. Also, ad- vancements in natural language generation allowed automatic sentence creation. This led to building of abstractive summarization techniques. Witbrock et al. [64] use extraction to obtain important summary words and then use a bi-gram language model to form sentences. Other approaches use shortening of sentences using sentence reduction rules. Knight et al. [30] use expectation maximization to compress the syntactic parse tree of a sentence. The tree is used to produce shorter but grammatically consistent version of summary sentences. During the whole time, the means to access the information changed by the introduction of world wide web in early 90’s. This led to a revised emphasis on the problem of text summarization to tackle

2 the problems of information overload. With web came different variants (social media based, etc) of use-cases where a summarization system could be used. Accordingly, summaries were created and summarization methods evolved. A series of highly successful summarization meetings have been held in past. Amongst them TAC 4 (Text Analysis Conference) has been the main evaluation forum for research in text summarization. It was previously known as Document Understanding Conference and began in the year 2000. Various Summarization tasks ranging from non-extractive summarization, spoken language (including dialogue) summarization, language modeling for text and speech summarization, multi-document and multilingual summarization, integration of question answering and text summarization, Web-based summarization, evaluation of summarization systems, etc were worked upon during the course of DUC/TAC workshops. This resulted in a wide range of high quality generation and evaluation methods. The datasets used to evaluate the systems are often used as benchmark to evaluate any given summarization system.

1.2 Multilingual Summarization

Internet is accessible to all, irrespective of their language. This has resulted in an extensive availabil- ity of textual data with linguistic diversity. Reading through all this information that is spread across languages is difficult. So, an efficient way to summarize information distributed in multiple languages is needed. Multilingual text summarization is the problem of producing summaries in a language T when the input contains the documents in language S different from T along with documents in language T , or when the input to the summarizer consists of automatic translations in language T of documents in language S [54]. This is a challenging problem because summaries produced from automatic trans- lations, using noisy input would have additional problems to those of lack of cohesion and coherence usually reported in text summarization research [43]. Extractive summarization requires scoring of sentences based on its importance. Scoring is done using various features (language independent) like term distribution, frequency patterns, position of sentence, length of the sentence, sentence similarity, etc. These features are effective when all the text is in one language. However, additional features are required if the text contains words in different language, especially word level features [6]. A major problem is to identify the words which have similar meaning in different languages. The most likely solution is to use language specific tools to translate and transliterate the text. In this case, the accuracy of translation and transliteration systems becomes a critical issue. Furthermore, the availability of these tools for languages having less resources is an additional problem. Most of the previous approaches use clustering and translation of documents to form summary. The basic idea of these techniques [5, 12, 6] is that they collect similar information together by clustering techniques for every language. Then, they find similar clusters across different language by translating clusters in one language and identifying similar cluster in another language. Final output is produced in

4www.nist.gov/tac

3 user desired language by substituting all the different linguistic sentences with a similar sentence in the required language. In past, TAC organized a pilot task 5 related to multi-lingual summarization. They provided news documents and their corresponding (human) translations in 8 different languages. The task required the systems to be able to summarize the documents in at least 3 different languages (independent of each other) with acceptable accuracy. The task was not itself multi-lingual summarization but was framed out of the basic idea that a good generic summarizer must be able to produce summaries in different languages with acceptable accuracy.

1.3 Evaluation of Summaries

Summary evaluation is an important part of summarization field. Evaluation is difficult primarily because there is no ideal summary as such. Past studies [28] have shown that, human summarizers tend to agree only about 60% of the times, and in only, 82% of the cases humans agreed with their own judgement. Apart from the human biasness involved in the evaluation of summaries, such a manual evaluation is also expensive and time consuming in nature. There is always a possibility of system generating a better summary that is different from reference human summary used as an approximation to the ideal output summary. Automatic evaluation methods are of two types, first evaluate summaries using human models and second evaluate without human models. Comparison of human summaries (models) to evaluate its in- formativeness has been the more popular approach. For various summarization tasks in TAC, system summaries are evaluated using ROUGE [35] scores. ROUGE stands for Recall Oriented Understudy of Gisting Evaluation. ROUGE measures summary quality by counting overlapping units such as the n-gram, word-sequence and word pairs between system summaries and human modeled summaries. Usually overlap based evaluation methods suffer from the problems of human variability, analysis gran- ularity and semantic equivalence [47]. The variable unit6 sizes (to be compared) in ROUGE addresses the problem of analysis granularity. The problem of semantic equivalence and human variability is addressed by using multiple human summaries to evaluate system summaries. Evaluating summaries without human models is relatively new in the field of summary evaluation. It is often thought as an unreliable way to evaluate summaries without gold summaries. However, there are instances where generating human summaries can be a bigger challenge especially in multi-lingual summarization. The challenge is that the annotator should be proficient in all the languages (of source documents) for which the summary is being generated. So the expenses as well as the knowledge requirement to create manual gold summaries increases. Louis et al. [38] proposed Jensen-Shannon divergence metric to evaluate summaries without human models. This measure was found to be highly effective to measure quality of the summaries and showed high correlation with ROUGE scores [55].

5http://www.nist.gov/tac/publications/2011/presentations/Summarization2011 MultiLing overview.presentation.pdf 6A unit can be a word, collection of words, phrases, or sentence.

4 1.4 Problem Description

Extractive summarization is the task of building concise excerpt of a given set of documents on the same topic. The summary should be able to convey the sense of the complete document(s) and avoid redundancy. Furthermore, the input documents can be in different languages retaining their relevance to the common topic. Our task is to build a summarization method based on a rich, informative and meaningful text representation. The representation should be language independent, yet effective on different languages. Extending the system to perform multi-lingual summarization and show that the basic feature of the representation can be used to leverage the quality of a different summarization process.

1.5 Overview of our approach

Creating a summary without any inference is poor and worthless use of its source documents. A summary should be able to retain the overall sense as well as convey the inference of its source docu- ments. Earlier approaches to summarization lacked inferential properties depending mostly on heuristic representation. This lead to content rich sentences but their combination could not covey the same in- ference as the original source documents. In our approach we have modeled the inferential properties of the text and built a robust summarization system using this representation. The detailed study of the representation and summarization method comes under generic summarization. The next step involves the use of this system as a part of multilingual summarization. In this part we compare our system to other systems that are based on different representations. In the final stages of the problem we worked upon a specific summarization task of summarizing conversations in the domain of on-line debates using the basic feature of our text representation.

1.5.1 Generic Summarization

In our method we have used Hyperspace Analogue to Language (HAL) [58] to represent text (words, sentences). HAL is formed by capturing co-occurrence patterns across the text by limiting the size of patterns by a window size. All the patterns are accumulated together in the form of W × W (W is the number of distinct words in the dataset) matrix. Each cell (wtij) of the matrix represents the contextual strength between words i and j and each row is a vector in the conceptual space. The HAL representation allows creation of new points (senses) in the same space by combining existing points. This property is used to form sentence vectors in the same space. The representation of sentence vectors can convey the context of any sentence, and is highly unique which can be used to disambiguate two similar sentences. Sentences are ranked based on the number and strength of senses it can convey. The summary formed in this manner carries similar sense as the set of source documents. To address the redundancy issue, Each time a sentence is selected we rank the remaining sentences based on the senses not present in the selected sentence. This increases summary coverage and removes redundancy in the

5 summary. Our results show that using inferential information leverages the quality of summary which are an improvement over previous state-of-the-art systems.

1.5.2 Multilingual Summarization

A framework has been created in which generic summarizers can be used to accomplish the task of multilingual summarization. The framework is mainly used to study two aspects of multilingual summarization. The first aspect is the effect of added noisy information on the summaries and the second is to understand the type of methods which are more suitable to the task of multilingual summarization. In context of summarization, addition of noisy information refers to the process where a potentially relevant information is added to be summarized containing syntactic errors caused by the translation step. In our approach we have used on-line machine translation system by google7. The documents in a given language (say T ) are translated to all the other languages (say set S) for which summarization is required. So, each of the languages has translated documents from other languages (referred as “Added noisy information”). Then, we use our generic summarizer along with three other existing state-of-the- art summarizers to generate monolingual summaries independently. The objective to use four different summarizers is to understand that which of these techniques is suitable for multilingual summarization. We have also analyzed the approach which is more robust against the noise in the data due to translation. Combination of monolingual summaries to create final multilingual summaries is based on the qual- ity of each summary against their input. The quality assessment is done using Jensen-Shannon diver- gence. Final summary is the linear combination of monolingual summary parts, where size of each part is proportional to their qualities. Redundancy in multilingual multi-document summarization is even a bigger issue because overlap of information is higher. Most relevant information is often preserved across the articles related to a topic in variable forms. To address redundancy here, we have used Jaccard similarity, which measures the word overlap between the summary and a new sentence to be added. Experimentally calculated threshold are used, and sentences above the threshold are discarded.

1.5.3 Summarization of Online Conversations in the domain of Debates

To measure the effectiveness of HAL representation we used them in the task of debate summa- rization. Debates are different from chats and casual conversations as they are conversed in a formal manner and usually contains either of the two debating topics. We have used the usual sentence ranking approach, to rank Dialogue acts (smallest unit of debate). Ranking of each sentence is done by a weighted linear combination of its feature vectors. Features represent the topic8 dependency and sentiments of each unit. Other superficial features such a positional and coverage is also used to rank the sentences. Evaluation of final summaries is performed using

7www.translate.google.com 8Refers to the two opposing topics of the debate

6 ROUGE measures. Comparison of system summaries is done against probabilistic variant of HAL, which has shown that the tasks in which the input text is highly opinion rich, we cannot do away with opinion relevance features.

1.6 Contributions of this work

Following are the contributions made in the process of solving the problem, as defined earlier:

1. We built an effective generic summarizer which is comparable to the state-of-the-art approaches. The summarizer has following key contributions:

• Extending concept combination characteristic of conceptual space for defining sentences. • Proposition of heuristics to combine concepts which underline the extension differing from previous approaches. • A novel theoretical framework for summary formation, supported by experimentally esti- mated parameters.

2. We built a framework and its underlying steps to form summaries from multilingual text. The system has following key contributions:

• Using the system we studied the effect of language interaction on the summaries. Our results show that the quality of summaries improve as the number of interacting languages increase. • The system methodology differs from previous translation based clustering techniques. • The system has been successfully implemented and studied upon three languages viz. En- glish, Hindi and Telugu.

3. We built a summarization system using our generic summarizer in the domain of debates. The system outperforms the previous state-of-the-art systems. The system has following key contri- butions:

• Intermediate features of our summarizer were added to the proven sentiment features lever- aging the final summary quality.

1.7 Thesis Organization

Chapter 2 presents the literature survey on summarization. Initially, it describes various factors to classify summarization tasks that presented the types in a tabular form. Following which, it describes previous seminal works in the field of generic summarization. Then ,the section discusses relevant methods for summarizing multilingual documents and on-line conversations in the domain of debates. Summary evaluation using ROUGE and Jensen-Shannon divergence has been discussed in the final parts of the chapter.

7 Chapter 3 describes the usage of conceptual spaces for multi-document summarization. It explains the theory behind conceptual spaces given by Gardenfors and its use to represent text. Properties of HAL representations are described which helps in building sentence representations in the text. Then a conjecture to compare two summaries, based on, the overall sense of documents they convey, is de- scribed. Based on the conjecture we describe our algorithm to create summaries. It is followed by different set of experiments to estimate system parameters and compare our system to previous works. Chapter 4 describes our framework for multilingual multi-document summarization. It explains the notion of added noisy information and the necessity to observe its effect on the summaries. Then, system architecture is described which is followed by the description of all the generic summarizers used. Following this, we describe the method to form multilingual summaries using Jensen-Shannon divergence. It is followed by different experiments to understand the whether adding information (noisy) from different languages help and which summarizer is most suitable for multilingual summarization. Chapter 5 describes the summarization of on-line conversations in the domain of debates. It describes the approach to form rank based summaries where ranking is dependent on various features. Following which we describe the calculation of these feature values. In the experimental section of the chapter calculation of weights for each feature is described. It also compares our system to previous state-of- the-art systems and results show that our system is effective. Chapter 6 concludes the thesis explaining the work done and describing the results of the experi- ments. It discusses the relevance of inferential properties in a representation and its effect on multi- lingual summaries. It elaborates the utility of multilingual summarization and addition of information from different languages to leverage summary quality. It also provides the details of future work with respect to the thesis. 2

8 Chapter 2

Related Work

2.1 Types of Summarization

With the advancement of summarization field summary formation process has been classified based on various factors. Following factors are considered to be important to describe different types of summarization.

• Input factors: text length, number of documents, genre, external query, text language, summary model, text behavior.

• Purpose factors : who is the user, the purpose of summarization.

• Output factors: running text or headed text etc.

Summaries can be classified based on the number of source text (single text vs. multiple texts summa- rization). If the input contains documents in different languages then it can be classified as multilingual summarization else monolingual summarization. Based on the availability of a trained summary model summarization can be classified as supervised vs. unsupervised. New genres of text have appeared ranging from very short (like twitter), short (comments) to longer text (blogs, articles, news, etc). They are classified based on the language structure of the text sentences (formal vs. informal). Sometimes the text information is updated regularly like a news reporting a month long event. In this case summaries are classified as update vs. static. Depending on the need of the user an external query can be given for summarization resulting into classification of summaries as query dependent vs. query indepen- dent summarization. Summaries which contain the same sentences as that of the source documents are called extracts whereas summaries containing system generated sentences are called abstracts. Thus, depending upon the sentences in the summary the summarization task can be classified as abstractive summarization vs. extractive summarization. Table 2.1 describes different types of summarization re- sulting from varying input, purpose and output factors.

9 Criteria Types

• Extractive: Summaries contain sentences, phrases or words from the original text. The sentences are not modified and selected based upon their importance to the text. Sentence Selection • Abstractive: An internal semantic representation of the text is built and natural language genera- tion techniques are used to create a summary that is closer to what a human might generate.

• Single-Document: Summary formation from a single document.

Number of Documents • Multi-Document: Producing a single summary from related source documents. Handling redun- dant information is a challenge when dealing with multiple documents.

• Query-Dependent: The query constraint gives the information requirement for the summary. Query dependent methods weight the input text with respect to the query and final summary External Query contains highly weighted sentences. • Query-Independent: Usually referred as generic summarizers. They select sentences based upon their overall importance to the input text.

• Mono-Lingual: Input documents are in single language. Methods are highly efficient and can use deep natural language analysis to form final summaries.

• Multi-Lingual: Input documents are in multiple languages. This is relatively new field in sum- Input Language marization and maintaining acceptable level of quality over different languages is a challenge.

• Cross-Lingual: Input criteria is similar to mono-lingual summarization. However, they use lin- guistic information from other languages to leverage the summary quality.

• Formal Text: Input documents are news articles, blog articles and formal documents. Input sentences are well-formed and the documents (usually) are self-contained.

Sentence Structure • Informal Text: input documents are social media chats, on-line discussion forums and the com- ments section. Input sentences are malformed with high to some minimal use of slangs, colloquial phrases and abbreviations. These methods rely heavily on efficient preprocessing of the input text.

• Supervised: Knowledge models from documents and their corresponding summaries are learned. These methodologies are relatively recent and developing along with the development of machine learning techniques. Learning Based • Unsupervised: Previous summarization results or feedback is not used to create summaries. All summaries are formed from scratch and once formed are not used for any other summary forma- tion step.

• Static: Source information remains unchanged so a summary once formed remains same.

Document Behavior • Update: Source information undergoes change as the time progresses. The methods must take into account this change and update the summary accordingly. Novelty detection is a challenge in update summarization. Highly useful in news domain.

Table 2.1 Types of Summarization

10 2.2 Generic Summarization

2.2.1 Feature based methods

Extractive text summarization uses the sentences in the text to create summaries. Feature based methods use various features to rank the sentences in given document(s). Over the years position based and frequency based features are the most commonly used features. Earliest work on summarization by Luhn [39], used number of word occurrences and the relative position of keywords within the sentence. The sentence scores reflect the number of occurrences of key- words within a sentence and a linear distance between them due to presence of non-significant words. Later work [10, 32, 36], added features such as sentence position, topic signature, cue words, date anno- tation, etc. These features were used to score sentences and top sentences were selected for summary. In MEAD [53], Radev et al. used various sentence level features like sentence length, sentence position, query-overlap (if query is given) using cosine similarity. A set of keywords were extracted from documents and their occurrence in a sentence was used as a feature. Top sentence was highly valued in a document and sentence similarity with respect to the top sentence (in the document) was added as a feature. Clusters were created and sentences were scored using sentence and inter sentence features.

2.2.2 Graph based methods

Graph based approaches [66, 65, 56] represent text as a graph. Salton el al. [56] apply knowledge of text structure to do automatic text summarization by passage extraction. They model intra-document linkage pattern of a text as a graph where the edges are formed using cosine similarity. A greedy graph traversal technique is applied in chronological order to form summary. In TextRank [44], each document is represented as a graph of nodes that stand for sentences inter- connected by similarity (overlap) relationship. The overlap of two sentences is simply determined as the number of common tokens between the two sentences, normalized by the length of these sentences. Then modified graph based ranking algorithms, such as Pagerank [50], HITS [29] and PPF [21] are used to rank the nodes. Motivated by the fact that a document contains various topic themes with varying level of importance, Cluster-HITS [61] create topic clusters to identify the sentences based on same topics. A bipartite graph is formed between the clusters and sentences based on the cluster-sentence similarity (based on word overlap). Sentence scores are calculated by applying HITS on the graph. Top scored sentences are used to form the summary.

11 2.2.3 Lexical chain based methods

Lexical chain based approaches [3, 34, 67] create lexical chains to represent the text. Formation of lexical chains can be done using available knowledge sources (like wordnet [6]) and other lexical features. Barzilay et al. [3] compute lexical chains in a text, merging several robust knowledge sources the WordNet thesaurus, a part-of-speech tagger and shallow parser for the identification of nominal groups. Summarization proceeds in three steps the original text is first segmented, lexical chains are constructed, strong chains are identified and significant sentences are extracted from the text. Construction of chains is a generative process where edges are created based on semantic sense (given by WordNet) and strength is calculated using inter sentence distance and frequency of co-occurrence. Strength of a chain is decided based on its length and number of distinct members. Sentences which contain highly weighted chains are selected in the summary and redundancy is reduced by including all lexical chain members in the summary. Zhou et al. [67] used lexical chains in multi-document summarization system, IS SUM. The ap- proach was divided into 4 components: preprocessing, clustering, summarization and compression. Preprocessing step extracts relevant text from XML files. The text is marked with POS tags, words are stemmed and word frequencies are calculated. Clustering is done based on inter-document similarity, computed by combining cosine similarity and phrase similarity. Lexical chains were formed for each cluster and used to create Document Index Graphic. The chains containing more key-phrases (nouns and verbs) were given higher score. The strongest chains of each cluster are selected to create the summary once all the chains have been built. Compression is achieved by the use of Maximal Marginal Relevance (MMR). Li et al. [34] modified IS SUM and worked upon improving its lexical chain algorithm for effi- ciency enhancement, applying the WordNet for similarity calculation and adapting it to query- focused multi-document summarization.

2.2.4 Other relevant methods

Redundancy removal has been a big issue in summarization. Carbonnell et al. [4] proposed Maximal Marginal Relevance to create a balance between information novelty and importance to create non redundant summaries. Another approach based on latent semantic indexing, in which singular value decomposition (SVD) is used to decompose a term by documents matrix. The resultant eigen values are used to rank the sentences for generic text summarization [17]. A holistic summarizer HolSum [18] was proposed which starts from an initial summary 1 then used a standard hill climbing algorithm to select similar summaries such that the new summary is more similar to the original text. Recently, document summarization based on data reconstruction [20] has been proposed in which the document is reconstructed by the linear combination of the selected sentences. An optimization function is used to get the sentences that are most informative with minimal redundant information.

1Lead sentences are selected as initial summary

12 2.2.5 HAL based methods

Earlier use of HAL has been done primarily in query dependent summarization tasks. Motivated by the observation that metrics based on key concepts overlap give better results when compared to metrics based on n-gram and sentence overlap. Jagadeesh et al. [26] combined relevance based language modeling , Latent Semantic Indexing and special words to create summaries. Relevance of sentence for a query was calculated by adding the HAL scores between words in the sentence and words in the query. In a later work [27], features based on sentence importance, independent of the query were added. Theses features were calculated using external documents that were extracted from web using the given query. The addition resulted in further improvement of system performance. Ma et al. [41] score sentences through the importance of their words and modified MMR technology is used to adjust the score of the candidate sentence. Word importance is decided by its query dependent score and topic related score calculated by HAL scores and likelihood with respect to query terms respectively. He et al. [19] also use similar approach to produce summaries of relevant documents acquired based on user-feedback information and transductive inference SVM machine learning. Morita et al. [46] use an HAL like approach to generate query dependent summaries. A co-occurrence graph is built to obtain words that augment the original query terms and enrich the information need. Summarization problem is then formulated as a Maximum Coverage Problem with Knapsack Con- straints based on word pairs rather than single words. All these approaches focus on query dependent summarization where summaries are influenced by the query. HAL scores are used to compute the relevance of words to the query words. Sentences containing highly relevant words are selected to form summaries. Our work on generic summarization differs from these approaches entirely, because we do not have a query to formulate a summary. We have used HAL representation for its inferential properties deviating from previous uses which uses them to calculate query relevance. However, for a later work on on-line conversations in debate domain we made use of HAL in a similar manner.

2.3 Multilingual Summarization

Summarization for languages other than English has been done in Scandinavian languages [7] and SUMMARIST project [23] which includes Indonesian. Both the system implement various language independent features such as keyword (calculated automatically), term frequency, position and special text elements to score and rank sentences. SUMMARIST also employs an optimal position policy where positional scores were generated using a set of documents and their pseudo-ideal summaries. Top ranked sentences are used to form summary. The Keizai system [49] is a Cross Language Text Retrieval system having summarization as a feature that gives user an overall comprehension of a document. The system gives summary in Japanese and Korean languages using statistical and symbolic techniques. These summaries are translated to English and both summaries are displayed to the user.

13 Saggion et al. [54] used an English-Arabic alignment table to translate documents and then use centroid-based sentences extraction techniques to form summaries and the final output contains sen- tences in English language only. Columbia Newsblaster [12] performs multi-lingual summarization by translating the documents and then uses clustering based methods to generate summaries. They focus upon quality of summarization systems for a single language by shifting the majority of the multi-lingual knowledge burden to specialized machine translation system. A similar work has been done by Chen and Lin [6] which performs multi-lingual news summarization in Chinese and English languages. Our method differs from previous work because added noisy data (translated) affects the quality of sum- mary. Thus, even though the summaries are generated in a given language it contains the information from other languages. This means that scores of a sentence is calculated with respect to all the sentences from all the languages.

2.4 Summarization of Online Conversations in the domain of Debates

In context of summarization of on-line conversations which are rich in opinions, identification of opinion containing sentences is important. Sentence relevance is further decided by their sentiment scores, topic relevance and other lexical and positional features. Earlier works mainly focused on re- views [51, 24, 48] which used lexical features (unigram, bigram and trigram), part-of-speech tags and dependency relations. Ku et al. [31] performed opinion summarization in news and blog domain. They propose opinion extraction at word, sentence and document level. For each new word, distribution of its characters (Chi- nese) as positive and negative polarity in the seed vocabulary (created manually) is used to determine sentiment of the word. These scores are compounded to compute sentence scores and then document scores. Presence of negation operators decided the sentiment tendency at sentence level which further propagated to document level. Wang et al. [62] performed opinion summarization on conversations. They used linear combination of features from different aspects including topic relevance, subjectivity and sentence importance to score sentences. They also proposed a graph based method, which incorporates topic and sentiment information, as well as additional information about sentence to sentence relations extracted based on dialogue structures. Summarization in the specific domain of on-line debates is a novel field. This domain differs from chatting and conversation because it is more formal and focuses on specific topics. It may be possible that the argument contains various different factual knowledge but they are usually related to one or the other topic. Similarly, it is different from news and blogs because it is comparatively more rich in sentiment. Therefore, summarization by opinion mining in debates is an interesting and challenging task.

14 2.5 Summary Evaluation

Evaluation of summary is necessary and advantageous to automatic summarization similar to lan- guage understanding technologies wherein, it can foster the creation of reusable resources and infras- tructure, it creates an environment for comparison and replication of results; it introduces an element of competition to produce better results [42]. System evaluation is done in two ways, intrinsic and extrin- sic. Intrinsic evaluation is done to assess the summarization system internally and extrinsic evaluation is done to assess the utility of summarization system on a real world task such as, reading comprehension tasks (answering a set of questions after reading the summary), relevance assessment tasks (evaluating whether the relevance of a summary to given topic is same as that of the source document) [45]. Both intrinsic and extrinsic evaluation are necessary and serve different purposes. Intrinsic evaluation is done to improve the system accuracy and polish the system results. Extrinsic evaluation is needed to understand the extent to which the system is able to accomplish a task involving summarization. Various methods have been proposed in both the directions involving different degree of human effort. There is a trade-off between amount of human work and effectiveness of the evaluation measure. Effort has been laid upon to automate the human effort or replicate human evaluation measures, thereby making evaluation process less expensive. We present two evaluation measures, ROUGE [35] which stands for Recall Oriented Understudy for Gisting Evaluation and Jensen-Shannon divergence [38]. The first method uses human reference summaries to evaluate the system and therefore, is expensive. However, it is effective and has been used extensively to evaluate the results in TAC conferences. The second measure involves no human effort and evaluates the summary based on its divergence from the input set of documents. This measure is useful for evaluation for multi-lingual summarization as manual effort in this case requires more effort as well as skillful annotator.

2.5.1 ROUGE

ROUGE [35] stands for Recall Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. We describe ROUGE-N, ROUGE-L and ROUGE-SU* measures used for evaluation purposes in our work.

2.5.1.1 ROUGE-N

This measure is an n-gram recall between a candidate summary and set of reference summaries. ∑ ∑ Countmatch(gramn)

Sϵ{ReferenceSummaries} gramnϵS ROUGE − N = ∑ ∑ (2.1) Count(gramn)

Sϵ{ReferenceSummaries} gramnϵS

15 Where n stands for the length of the n-gram, gramn and Countmatch(gramn) is the maximum number of n-grams co-occuring in a candidate summary and the reference summaries.

2.5.1.2 ROUGE-L

This measure uses the longest common subsequence between candidate summary and reference summary to estimate the similarity between the two summaries. This measure effectively captures the sentence level structure. Let X be the reference summary sentence of length m and Y be the candidate summary sentence of length n. Then, Recall, Precision and F-measure are calculated in the following manner. LCS(X,Y ) R = (2.2) m LCS(X,Y ) P = (2.3) n 2RP F = (2.4) R + P Where, LCS(X,Y ) is the longest common subsequence between X and Y .

2.5.1.3 ROUGE-SU*

This measure uses the skip-bigram 2 count between candidate summary and reference summary to estimate the similarity between the two summaries. This measure is sensitive to word-order without requiring consecutive matches. Unigram matches are also included in this measure, to give credit to a candidate sentence if the sentence does not have word pair co-occurring with its reference. Recall, Precision and F-measure are calculated in the following manner. SKIP 2(X,Y ) + Count (unigram) R = match (2.5) C(m, 2) + m

SKIP 2(X,Y ) + Count (unigram) P = match (2.6) C(n, 2) + n

2RP F = (2.7) R + P Where, SKIP 2(X,Y ) is the number of skip-bigram matches between X and Y . C(m, 2) and C(n, 2) are the total skip bi-grams in the two sentences. Spurious matches are reduced by limiting the maximum skip distance dskip, between two in-order words. Accordingly, the denominators of 2.5 and 2.6 are calculated. In TAC (DUC), dskip is set to 4 and is deemed sufficient to capture summary similarity reliably. For our work, we have used F scores for evaluation purposes, as they represent both precision and recall aspects, for different matches: unigram (ROUGE-1), bigram (ROUGE-2), longest subsequence (ROUGE-L) and skip-bigram with unigram (ROUGE-SU*).

2Skip-bigram is any pair of words in their sentence order, allowing for arbitrary gaps.

16 2.5.2 Jensen-Shannon divergence

Louis et al. [38] evaluated various measures for content selection evaluation in summarization that does not require the creation of human model summaries. Three different types of measures were studied as follows:

1. Distributional Similarity based: Based on the assumption that good summaries are characterized by low divergence between probability distribution of words in the input and summary, and by high similarity with the input. Experiments were done using Kullback Leibler Divergence, Jensen Shannon Divergence and cosine similarity measures.

2. Summary Likelihood: Likelihood of a word appearing in the summary is approximated as being equal to its probability in the input. Summary’s unigram probability and probability under a multinomial model were calculated.

3. Use of Topic words in the summary: Summaries containing topic signatures during content se- lection have been usually considered better. Thus, coverage based and common topic signatures were calculated between summary and the input.

The results showed that Jensen-Shannon divergence, which measures word distribution dissimilarity between the summary and the input performed the best out of all the measures. Saggion et al. [55] further studied the JS divergence and found positive medium to strong correlation among system’s ranks produced by ROUGE and divergence measures that do not use the model summaries. JS divergence between two probability distribution (P and Q) is given by,

1 J(P ||Q) = [(P ||A) + D(Q||A)] (2.8) 2 where,

∑ p (w) D(P ||Q) = p (w)log P P 2 p (w) w Q

P + Q A = 2

Here, P represents summary and Q represents input documents to which summary is compared.

Furthermore, pP (w) and pQ(w) represent probability of word w in P and Q respectively.

17 2.6 Concluding Remarks

Throughout the literature survey we came across various methods to summarize the text. The meth- ods depend upon the type of summarization problem they are trying to solve. Classification of sum- maries depend upon the input, purpose and output factors. For example, query dependent summarization techniques use the given query to select relevant sentences. Approaches also use the inherent charac- teristics of the input text. Like, monolingual summarization has many approaches which makes use of language dependent morphological and dependency features (especially lexical chain based methods). Some recent summarization methods [57] use conditional random fields (CRF) built upon part of speech tags of the text to built the appropriate summarization models. In context of multilingual summarization, methods either use translation or transliteration or both before summarizing the documents. Most approaches follow cluster and map approach to build sum- maries. Work has been done upon Scandinavian, Chinese and Arabic languages. The field is relatively new and depend upon the quality of translation systems. Thus, an improvement in translation systems will help to improve this field. Another issue with multilingual summarization is that of evaluation. Manual creation of multilingual summaries is a very expensive process and requires highly skillful an- notators. However, the task of multilingual summarization is necessary to build summarization systems which can be accountable to the information distributed among different languages.

18 Chapter 3

Multi-Document Summarization Using Conceptual Spaces

3.1 Motivation of our Approach

Usually, summarization approaches are based on frequently observed heuristics in the data. Heuris- tics are patterns such as, similar sentences contain similar words; top (position-wise) sentences are more important to summary; highly frequent words are more important, etc. However, representing sentences with features like position, frequency of constituent words and length can not show its meaning. Fur- thermore, similar sentences based on word overlap are limited to conveying statistical similarity rather than semantic similarity. These issues can be effectively handled, if the representation contains contex- tual and inferential information. Earlier, features like latent dimensions [9] and lexical chains uncover the word-word and word-doc similarity. However, latent dimension cannot be explained in a symbolic sense. Whereas, lexical chains are dependent on availability of reliable and efficient knowledge sources. Development of such resources is itself a major research problem in many languages.

Heuristics reflect a thought process and high occurrence of such patterns across multiple text justify the existence of a corresponding cognitive process. However, justification of existence does not provide insights into the actual cognitive process. For example, if we come across a textual entity of the form ”1 + 1 = 2”, heuristics can identify similar patterns like ”x + y = z”. The pattern suggests that, “+” repre- sents an existing but undefined process. In reality, “+” represents accumulation (here) which cannot be identified using heuristic features. Therefore we need imbibe cognitive knowledge in our representation to bring out the underlying meaning of a pattern. The incorporation of such knowledge could prove helpful in summarization because we can select sentences which are more meaningful for the summary. The selected sentences are inferentially rich, which gives a better understanding to what the documents are all about.

19 3.2 Text Representation Overview

We use HAL representation which represents text in a highly meaningful manner. In this Section we briefly describe the algorithm to form HAL vectors. The description is done for the below sentence 1

“The aggressive identification and treatment of HIV-infected intravenous drug users with latent tu- berculous infection is therefore of both clinical and public health importance,” wrote Dr. Peter A. Selwyn of Montefiore Medical Center in New York.

Step 1: A co-occurrence matrix is formed for the complete dataset (Section 3.3.2). In this matrix all the rows represent a single vector, some rows in the matrix are shown below. 2 aggressive [10] people: 5.0, identification: 5.0, workers:4.0, treatment:4.0, care:3.0, hiv:3.0, health:2.0, infected:2.0, live:1.0, intravenous:1.0, identification [10] aggressive:5.0, treatment:5.0, people:4.0, hiv:4.0, workers:3.0, infected:3.0, care:2.0, intravenous:2.0, drug:1.0, health:1.0, treatment [62] tuberculosis:26.0, drug:12.0, cases:12.0, americans:12.0, mandatory:10.0, undergo:8.0, providing:8.0, people:7.0, tb:7.0, patients:7.0, hiv [76] infected:28.0, tb:28.0, virus:20.0, percent:17.0, people:16.0, aids:13.0, carried:12.0, twenty:9.0, drug:9.0, causes:9.0, infected [121] tb:34.0, people:34.0, million:29.0, aids:28.0, tuberculosis:28.0, hiv:28.0, bacteria:28.0, virus:19.0, whites:15.0, americans:15.0, intravenous [10] drug:5.0, infected:5.0, hiv:4.0, users:4.0, treatment:3.0, latent:3.0, identification:2.0, tuberculous:2.0, aggressive:1.0, infection:1.0, drugs [24] six:9.0, months:8.0, developing:7.0, countries:7.0, compliance:7.0, major:7.0, combat:5.0, cost:5.0, person:5.0, tuberculosis:5.0, users [18] drug:10.0, study:6.0, latent:5.0, methadone:5.0, tuberculous:4.0, conducted:4.0, intravenous:4.0, program:4.0, infection:3.0, infected:3.0, latent [18] percent:7.0, drug:6.0, users:5.0, tuberculous:5.0, tb:5.0, americans:5.0, infections:4.0, infection:4.0, clinical:3.0, intravenous:3.0, tuberculous [10] latent:5.0, infection:5.0, users:4.0, clinical:4.0, drug:3.0, public:3.0, intravenous:2.0, health:2.0, infected:1.0, importance:1.0, infection [104] tuberculosis:12.0, tb:12.0, whites:12.0, aids:10.0, stead:10.0, drug:8.0, blacks:8.0, hiv:8.0, nursing:8.0, diseases:8.0, clinical [10] infection:5.0, public:5.0, health:4.0, tuberculous:4.0, latent:3.0, importance:3.0, users:2.0, wrote:2.0, drug:1.0, dr:1.0, public [25] health:15.0, officials:8.0, clinical:5.0, states:5.0, tuberculosis:5.0, trend:5.0, united:4.0, infection:4.0, epidemic:4.0, importance:4.0, health [162] officials:34.0, tuberculosis:32.0, department:30.0, aids:17.0, public:15.0, epidemic:13.0, dr:11.0, city:11.0, federal:10.0, year:9.0, workers:9.0, importance [10] health:5.0, wrote:5.0, public:4.0, dr:4.0, clinical:3.0, peter:3.0, infection:2.0, selwyn:2.0, tuberculous:1.0, montefiore:1.0, wrote [10] importance:5.0, dr:5.0, health:4.0, peter:4.0, public:3.0, selwyn:3.0, clinical:2.0, montefiore:2.0, infection:1.0, medical:1.0, dr [110] tuberculosis:15.0, health:11.0, george:10.0, director:10.0, commissioner:9.0, myers:9.0, whites:8.0, jr:6.0, blacks:5.0, william:5.0, peter [10] dr:5.0, selwyn:5.0, wrote:4.0, montefiore:4.0, importance:3.0, medical:3.0, health:2.0, center:2.0, public:1.0, york:1.0, selwyn [10] peter:5.0, montefiore:5.0, dr:4.0, medical:4.0, wrote:3.0, center:3.0, importance:2.0, york:2.0, people:1.0, health:1.0, montefiore [10] selwyn:5.0, medical:5.0, peter:4.0, center:4.0, dr:3.0, york:3.0, people:2.0, wrote:2.0, importance:1.0, carry:1.0, medical [29] montefiore:5.0, center:5.0, legislative:5.0, submitted:5.0, beds:5.0, wards:5.0, welfare:4.0, selwyn:4.0, proposal:4.0, patients:4.0, center [19] tuberculosis:5.0, medical:5.0, york:5.0, disease:5.0, informed:5.0, people:4.0, lung:4.0, montefiore:4.0, diena:4.0, increase:3.0, york [46] city:16.0, tuberculosis:15.0, cases:12.0, aids:8.0, tb:8.0, united:6.0, medicine:5.0, blacks:5.0, people:5.0, population:5.0,

Step 2: Next, we perform vector addition on all the words of a given sentence (Section 3.3.3). This results in the sentence vector shown below,

1DUC-2001 dataset d15c, DOC NO: AP890302-0063 2The rows are relevant to the example sentence.

20 <( tuberculosis : 110.000 ) (tb: 82.000), (people: 68.000), (aids: 65.000), (hiv: 64.000), (drug: 54.000), (health: 45.000), (officials: 42.000), (americans: 41.000), (infection: 41.000), (infected: 38.000), (department: 38.000), (million: 38.000), (percent: 38.000), (public: 37.000), (bacteria: 37.000), (treatment: 35.000), (dr: 34.000), (latent: 34.000), (users: 33.000), (cases: 31.000), (tuberculous: 30.000), (study: 29.000), (clinical : 29.000), (whites : 27.000) >

Step 3: The resultant sentence vectors are normalized across the dimensions (words) over all the sen- tences. Summaries are formed using the resultant normalized sentence vectors. Some words are in boldface, so that readers can relate the representation to the sentence. Decimals show weight given to the word. < (peter: 0.192), (tuberculous: 0.156), (clinical: 0.156), (wrote: 0.144), (importance: 0.144), (intravenous: 0.118), (selwyn: 0.098), (users: 0.084), (latent: 0.074), (alcohol: 0.064), (iden- tification: 0.062), (philip: 0.059), (san: 0.052), (worried: 0.051), (aggressive: 0.049), (illness: 0.045), (day: 0.045), (panel: 0.045), (multiple: 0.043), (diseases: 0.043), (francisco: 0.042), (eliminate: 0.042), (public: 0.041), (rehabilitation: 0.041), (montefiore: 0.038) >

Observe that there are some words which are not in the sentence and can be seen in its representation. However, all these words are important to the context of this sentence and help us to infer various useful information about this sentence. High weight is given to words “peter”, “selwyn”, “montefiore” representing speaker of this sentence. This is followed by “tuberculous” which outlines the type of infection and the word “alcohol” which is equally harmful as drugs (suggested by another line from the same text). Sense of importance is identified by words “clinical” and “importance”. Word “aggressive” highlight the intensity of measures required for this case. Note that, “philip”, “san” and “francisco” have high weights. It is because d15c (set of documents) has a sentence with similar sense, which has been spoken by Dr. Philip C. Hopewell of San Francisco. The subsequent section 3.3 describes the theory, properties and creation of conceptual spaces with complete details. Section 3.4 describes the summary formation process from these vectors.

3.3 Conceptual spaces as a representative model

3.3.1 Gardenfors’¨ Conceptual Spaces

Conceptual space is one of the three levels of a cognitive model proposed by Gardenfors¨ [15]. Ac- cording to this model, cognitive representation can be done at three levels: symbolic, connectionist3 and conceptual. Symbolic representation tend to view every process as symbol manipulation, that can be modeled by Turing machines. Connectionist representation focuses on association between various elements. Conceptual representation derives sense from the geometrical structure of its elements. The overall relation amongst the three representations can be understood as follows: Given any represen-

3Connectionism is a special case of associationism which is modeled using artificial neural networks

21 tation, Symbolic level draws out the characteristics and functioning of all the symbolic entities. Then each symbol is connected to each other at connectionist level. This geometrical composition is used by conceptual representation to infer the sense, adding a meaning to the complete representation.

In more abstract terms, a conceptual space CS consists of a class of quality dimensions D1,D2, ...Dn. A quality dimension refers to a characteristic property which is important to describe any information uniquely. A point in CS is represented by a vector v =< d1, d2, ..., dn > with one index for each dimension[15]. Within this space, a concept is defined as a convex region. This means that all the points that are present in this region represent the same semantic sense with minor variations. Earlier, HAL has been successfully used to create conceptual spaces [58]. This motivated us to use HAL space to build the conceptual space from the documents. HAL is a representational model of semantic memory based on the that, when humans encounter a new concept they derive its meaning from accumulated experience. This means that the meaning of a concept can be acquired through its usage with other concepts within the same context [40]. Throughout the following text we will use words “dimension” and “concept” interchangeably. This is done because HAL space is defined by words of the documents as its dimensions and these words represent a concept in itself (defined in this space). So, “dimension” refers to the role of word when it is acting as a building block. Whereas, “concept” refers to that role which defines its own meaning in the document.

3.3.2 Forming Conceptual Spaces using HAL

Given a lexicon of n words, HAL represents a n × n co-occurrence matrix in which each element contains the cumulative co-occurrence score between two words. Cumulative co-occurrence score is obtained by accumulating the scores between the two words over the whole document by moving a window of size K. The co-occurrence score between two words at a distance k is calculated as a product between (K − k + 1) and frequency of their occurrence at a distance k. Thus, Cumulative co-occurrence score between two words over the complete set of documents is given by,

∑K Score(wi|wj) = (nk ∗ (K − k + 1)) (3.1) k=1

where, nk is the frequency of occurrence of wi and wj at a distance k. HAL is direction sensitive as the co-occurrence information for words preceding each word and co-occurrence information for words following each words are recorded separately by row and column vectors [59]. Thus, the dimension of each word is 2n. Similar to [58], we do not consider the direction sensitivity of word pair. The row and column vector into one, reducing the dimension of each vector to n. Within HAL space, a concept is defined as a weighted vector.

ci =< wti1, wti2, ....wtin >

22 Figure 3.1 Concept combination in a 3-dimensional conceptual space where the combined concept is more refined. wtij is weight of the concept ci along dimension dj. Weight shows the strength of contextual similarity

that exists between ci and the concepts representing the dimension in the documents. Consider the fol- lowing example of a concept vector tuberculosis.

tuberculosis: < (cases: 108.0), (aids: 72.0), (people: 59.0), (bacteria: 42.0), (active: 41.0), (disease: 40.0), (health: 32.0), (risk: 29.0), (infected: 28.0), (percent: 27.0), (year: 27.0), (treatment: 26.0), (number: 25.0), (epidemic: 24.0), (united: 23.0), (tb: 23.0), (virus: 23.0), (case: 23.0), (vermund: 22.0), (reported: 21.0), (states: 19.0), (patients: 18.0), (years: 17.0), (morbidity: 17.0), (control: 17.0) >

It can be observed that tuberculosis is a bacterial health disease which is indicated by relatively high scores of “bacteria”, “disease” and “health” weights. Moreover, “cases”, “aids” and “people” are given higher weights as the documents4 talk about an increase in tuberculosis cases due to aids. This shows that HAL can preserve contextual meaning of a word along with its conceptual meaning and efficiently capture inferential characteristics of the documents. As a result, they form an effective implementation of conceptual space.

3.3.3 Sentences in Conceptual Space

An important characteristic of conceptual space is the ability to define new concepts by combining existing concepts. Figure 3.1 , shows the effect of concept combination in conceptual space. It can be seen that the combined concept envelopes more space in the conceptual space. This means that the new concept is more meaningful than the concepts composing it. This allows the formation of various meaningful concepts within the domain of our space. An implementation of this characteristic was proposed as a 4-step heuristic approach [58].

In this approach, two concepts c1 and c2 are combined by first re-weighting so that higher weights 5 are assigned to the dimensions of dominant concept (assume c1 here). Then, common dimensions were strengthened by a factor greater than 1. Strengthening ensures that a common dimension has more chance to become a quality dimension of the resultant concept. Then, the two concepts are composed

4DUC2001: d15c 5A dominant concept is the more significant amongst the two concepts.

23 together to form the resulting concept c1 ⊕ c2.

wtc1⊕c2i = wt1i + wt2i

Finally, the vector c1 ⊕ c2 is normalized so that they can be compared at same level. Our approach has similar motivations but we differ in our heuristics. Following are our underlying heuristic factors.

1. Heuristic of dominant concept is not used in concept combination because all the concepts are considered equivalent to each other. This allows the summary to be unbiased towards any concept based on an initial judgment.

2. Strengthening of overlapping dimensions is not done because concept combination takes place within relatively closer occurring words. In a given context, closer terms are usually used together most of the times. As a result, they share many common dimensions which automatically gets strengthened after combination.

3. Instead of normalizing each concept across its dimensions, we normalize each dimension along all the vectors. This serves us two purposes:

(a) Weight for every dimension is scaled between [0 - 1] . This makes the weights of different dimensions comparable to each other. (b) The concepts (in this case sentences) are now represented as left stochastic matrix. In a stochastic matrix, all columns sum to 1. Thus, our documents are now represented by a fixed point in the HAL space where all the dimensions have value 1.

Based upon above heuristics, we create sentence vector using following steps:

6 Step 1: Given a sentence si = w1, w2, w3..., wli with li words . The composition of all the words results into the representation of sentence in the conceptual space.

⊕ ⊕ ⊕ s = w1 w2 w3... wli (3.2)

Step 2: Sentence vectors are normalized along the dimensions as follows:

wtij ∀ { } wtij = ∑m jϵ 1, n (3.3) wtkj k=1 where, i denotes ith sentence and m is the number of sentences in the documents. The resultant sentence vector encapsulates the inherent meaning and context of its composition words.

6After removing stopwords.

24 This can be examined from the following example of a sentence and its vector in conceptual space. The doctors warned that besides being at risk of getting tuberculosis themselves, AIDS-infected addicts who carry the TB bacteria also may pass the germs to people they live with, to health care workers and other people.:7

< (germs: 0.106), (pass: 0.101), (live: 0.099), (special: 0.097), (care: 0.094), (aggres- sive: 0.077), (intervention: 0.065), (montefiore: 0.064), (identification: 0.062), (minor- ity: 0.060), (administer: 0.060), (contracting: 0.055), (myers: 0.054), (strategies: 0.054), (showing: 0.053), (scared: 0.052), (symptoms: 0.051), (selwyn: 0.049), (largely: 0.049), (review: 0.049), (added: 0.048), (elderly: 0.048), (capacity: 0.048), (warned: 0.047), (died: 0.047) >

In the above example words “aggressive”, “special”, “contracting” and “intervention” 8 do not occur in the sentence yet they are weighted highly in the vector. We observe the aforesaid, because in context of the documents, this sentence suggests that AIDS infected people having TB should be given special care. So, special measures like intervention should be taken by the officials to stop the disease from spreading further. This shows that the sentence vector obtained by combination has inferential charac- teristics. These are similar to the observations for a single word concept as a result, a sentence vector can be interpreted as a concept in the HAL space. Concepts obtained by combinations are more refined and are capable of disambiguating between mul- tiple contexts. The two properties of combined concepts have been shown in [58] by vertical and hori- zontal tests respectively. Here, we show the effect of combination on a sentence concept with respect to these properties. Consider the following two sentences and their vectors.

1. Tuberculosis is caused by a bacterium that commonly affects the lungs but can attack almost any organ.9 < (affects: 0.500), (attack: 0.281), (organ: 0.167), (organs: 0.129), (commonly: 0.110), (bacterium: 0.059), (lungs: 0.039), (caused: 0.036), (attacks: 0.029), (majority: 0.024), (preventable: 0.023), (vast: 0.023), (long: 0.012), (last: 0.011), (transmitted: 0.011), (communicable: 0.009), (ill: 0.009), (crowded: 0.009), (decades: 0.008), (air: 0.007), (workers: 0.007), (poor: 0.006), (highly: 0.003), (infection: 0.003), (sick: 0.003) >

7DUC-2001 dataset d15c, DOC NO: AP890302-0063 8There are others but we emphasize them because of their higher weights. 9DUC-2001 dataset d15c, DOC NO: AP900521-0063

25 2. The disease, which attacks the lungs, has long been associated with poor, crowded living condi- tions.10 < (poverty: 0.17), (history: 0.17), (crowded: 0.15), (ravaged: 0.143), (living: 0.138), (vengeance: 0.13), (affects: 0.125), (famous: 0.097), (opportunistic: 0.094), (attack: 0.094), (conditions: 0.091), (housing: 0.091), (attacks: 0.088), (cancer: 0.088), (organs: 0.081), (blamed: 0.074), (shortcomings: 0.067), (illness: 0.057), (organ: 0.056), (back: 0.053), (long: 0.049), (combination: 0.049), (socioeconomic: 0.048), (research: 0.048), (europe: 0.044) >

Sentence 1 talks about tuberculosis, its cause and the affected parts. This is evident from the sentence vector where, “affects”, “organs”, “bacterium”, “lungs” and “caused” are highly weighted. Further observe that, concepts “transmitted”, “communicable”, and “infection” are also highly weighted, which tells us about the nature of the disease. This shows that concept combination has enriched the sentence with new information and we obtain a refined representation of sentence. Similarly, sentence 2 also talks about tuberculosis (though the word does not occur exclusively in the sentence), affected parts and factors which are socioeconomic in nature. This is evident from high weights of “illness”, “organs” , “poverty”, “crowded”, “living”, “housing”, “conditions” and “socioeco- nomic” . We notice that both sentence talk about a common disease and its affected parts. Moreover, the first sentence talks about cause and the other about socioeconomic factor. The distinction can be observed from their respective vectors, where one weighs “bacterium” high whereas other weighs “poverty” high. This distinction can also be seen from the weight of “poor” in sentence 1, which is relatively low. From above we can conclude that a sentence vector obtained by concept combination has following properties:

1. It is a concept in the constructed conceptual space encapsulating all the inferential characteristics of the concepts composing the sentence.

2. It is highly enriched which provides more depth to the meaning of concept.

3. It has a sense of uniqueness and can disambiguate itself from a similar sentence in the given context.

Next, we describe our underlying principle to form summaries in the conceptual space. Based on which we propose two metrics and redundancy removing technique to realize the summaries.

3.4 Conceptual Multi-Document Summarization (CMDS)

This section describes construction of CMDS system. A schematic overview of the system is shown in Figure 3.2.

10DUC-2001 dataset d15c, DOC NO: AP900215-0031

26 Figure 3.2 Schematic-overview of the complete system

27 Figure 3.3 Representation of documents and summary in a 3-dimensional conceptual space.

3.4.1 Principle

th For a summary S containing l sentences we define WSj for the j dimension of S as, ∑m WSj = αiwtij i=1 { 1 if sentence i is present in S αi = 0 otherwise

Then we use the characteristic of the sentence representation to propose the following conjecture to form a summary,

Conjecture 1 A summary S, however concise it maybe, can provide maximum overview of the docu-

ments, if it contains those sentences which maximize WSj for maximum number of dimensions, given that all the concepts are treated uniformly.

Let, S and S’ be collection of l sentences.

Let, Ns = number of dimensions for which Wsj > Ws′j.

Let, Ns′ = number of dimensions for which Wsj < Ws′j.

Let, Ns′ > Ns . Since, all the concepts are treated equally (given), S’ provides a better overview of text than S because S’ contains text which gives more information by maximizing more concepts than S. Figure 3.3 shows a pictorial representation of documents and their summary in a 3-dimensional conceptual space. From this, it is quite obvious that when the area covered by summary increases its resemblance to the documents increase. This means that the meaning (of summary) starts becoming more similar to that of the documents. Summary area can be increased in two ways: first by taking

28 more concepts in the summary and second by choosing those concepts which have high weights for these dimensions. However, the summary size is restricted, so we adopt the second way of choosing concepts (sentences) having higher dimensional weights. Based on this principle, sentences are scored using following metrics.

3.4.2 Metrics

1. Rank: In each dimension, sentences are ranked in decreasing order of their weights. Let rij

denote the rank of sentence i along dimension dj and sci denote score of sentence i, then the score across all the dimension is computed as follows: ( ) ∑n 1 sci = √ xϵ[1, 8] (3.4) x r j=1 ij

For every dimension inverse of xth root of the rank is added to the score.

2. Weight: Weight of a sentence along a given dimension directly represent its strength for that dimension. So, the score is computed by merging the weights of a sentence for all its dimensions as follows: ∑n (√ ) sci = y wij yϵ[1, 5] (3.5) j=1

For every dimension yth root of the weight is added to the score.

3.4.3 Redundancy Removal

Redundancy in a summary should be minimal. In order to create non-redundant summaries, concepts covered in the summary are removed from the conceptual space. This reduces the dimensionality of the space and as a result concept vector are represented as,

ci =< wti1, ...wtij, ....wtin >

such that,

∄dj for which djϵS ∧ djϵ CS

Hence, further scoring of sentences is done upon the remaining dimensions. This reduces the search space and selects sentences encapsulating remaining concepts, covering all the topics and making the summary non-redundant. Selected sentences will not be ranked again as the new search space does not contain any of its constituent concepts. Algorithm 1 describes summary formation procedure. Algorithm 2 describes update score function. Note that, at a given time only one of the two metrics is used to and the other is assigned 0 value.

29 Algorithm 1 Summary Formation 1: Input:

• The sentence set Sets = [s1, s2, ..., sm] • The word set: Setw = [w1, w2, ..., wn] • Sentence Vectors: V = [v1, v2, ..., vm] • Summary size limit: L • Root of Rank: x • Root of weight: y 2: Output:

• Set of summary sentences: S ⊆ Sets 3: Procedure: 4: initialize sc ; 5: initialize wordF lag ← {F alse}n ; 6: while size(S) < L do 7: sc ← {0}m; 8: for i ← 1, n do 9: 10: if wordF lag[i] ≠ T rue then 11: sc ← UpdateScores(V∗i, sc, x, y); 12: end if 13: end for 14: i ← indexOfMaxScore(sc); 15: S ← S + si; 16: for all wϵsi do 17: wordF lag(indexOf(w)) ← T rue; 18: end for 19: end while 20: return S

3.5 Experimental Setup

In this study, we used standard summarization datasets DUC 2001 and DUC 2002 for evaluation. These datasets were chosen because standard human summaries are available for them. The important feature of these summaries is that they were built to evaluate generic summarization tasks 11. So, these datasets can be effectively used to evaluate any generic text summarizer. DUC 2001 and DUC 2002 contain 30 and 60 document sets respectively, with 10 news articles in each set. Sentences in DUC 2001 were separated manually. For DUC 2002, they have been separated by NIST. Stopwords were removed before the summarization process based on the list provided by MIT.12 For evaluation purposes, DUC 2001 and DUC 2002 provide 4 human and 2 human summaries of size 50, 100, 200 and 400 words respectively. Note that, DUC 2002 does not contain human summaries of

11http://www-nlpir.nist.gov/projects/duc/guidelines/2001.html http://www-nlpir.nist.gov/projects/duc/guidelines/2002.html 12http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop

30 Algorithm 2 Update Scores 1: Input:

• Weight vector(= V∗i): W t = [wt1, wt2, ..., wtm] • Sentence scores: sc = [sc1, sc2, ..., scm] • Root of Rank: x • Root of weight: y 2: Output:

• Updated sentence scores: sc = [sc1, sc2, ..., scm] 3: Procedure: 4: initialize r ← rank of ith sentence; 5: if x > 0 then 6: for i ← 1, m do 1 7: sc ← sc + √ ; i i x [ri] 8: end for 9: else if y > 0 then 10: for i ← 1, m do√ y 11: sci ← sci + [wi]; 12: end for 13: end if size 400 (words). We have considered results of DUC 2001 more reliable because they are evaluated on more human summaries. All the evaluation scores are computed using ROUGE. It has been been widely used by DUC to evaluate system summaries. We choose two automatic evaluation methods ROUGE- 1,2 and ROUGE-SU4 which compute unigram, bigram recall measure and overlap of skip-bigrams13 respectively. We have conducted the following experiments:

1. Intrinsic experiments

• Variation of window size (K) between 1 to 9.

• Variation of xth root between 1 to 8 for rank metric.

• Variation of yth root between 1 to 5 for weight metric.

2. Extrinsic evaluation

• Comparison of CMDS summaries with previous state-of-the-art systems (briefly described later).

13Skip-bigrams are those pair of words which allow arbitrary gaps, but preserve sentence order. ROUGE-SU4 allows skip distance of 4 words.

31 3.6 Results and Discussion

3.6.1 Intrinsic Experiments

3.6.1.1 Effect of variable window size:

In this section we discuss the quality of summaries when window size ”K” is varied. Recall that window size is an intrinsic parameter to HAL which governs the number of co-occurrence relations to be captured. However, longer window may result in forming false associations between words. Figure 3.4, shows graphs for ROUGE scores, when K is varied between 1 and 9. A significant rise can be observed on summary quality from K = 1 to K = 3. This rate of change decreases and gradually stabilize after K = 5. We have kept K = 6 for creating summaries as it gives slightly improved results.

90 Words 100 Words 200 Words 400 Words DUC 2001 0.5 0.12 0.18 0.45 0.1 0.16 0.4 0.14 0.35 0.08 (a) (b) (c) 0.12 0.3 0.06 0.1 ROUGE−I Scores 0.25 ROUGE−II Scores

ROUGE−SU4 Scores 0.08 0.04 0.2 0.06

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Window Size (K) Window Size (K) Window Size (K)

DUC 2002 0.09 0.4 0.14 0.08 0.35 0.12 0.07

0.3 0.06 0.1 (d) (e) (f) 0.05 0.25 ROUGE−I Scores 0.08 ROUGE−II Scores ROUGE−SU4 Scores 0.04 0.2 0.06 0.03 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Window Size (K) Window Size (K) Window Size (K)

Figure 3.4 Summary quality v/s window size (K)

3.6.1.2 Effect of variable metrics:

Sentence scoring is an important part of summarization. In CMDS it is influenced by two metrics: rank and weight. Here, we discuss the behavior of summaries when the metrics are varied and finally, arrive at some preferred values of these metrics. Figure 3.5, show graphs for ROUGE scores, when x is varied between 1 and 8. We can observe that for longer summaries both DUC 2001 and DUC 2002 datasets show similar behavior. In the first

32 dataset, when the value of x is increased from 1 to 5 quality of summaries improve gradually. However, the quality is either constant or decreases for x between 5 and 8. Similarly, for second dataset quality is less at either side of x = 4. These observations indicate that optimum summaries can be obtained when x lies between 4 and 5. For shorter summaries more variation can be observed in DUC 2002 dataset. However, the trend remains similar to that of longer summaries. As a result, optimal summaries can be obtained when x lies between 5 and 6. From these observations we conclude that for x = 5 optimal quality can be achieved for both longer and shorter summaries.

DUC 2001

0.12 0.18 0.45 0.1 0.16 0.4 0.14 0.35 0.08 0.12 (a) (b) (c) 0.3 0.06 0.1 ROUGE−I Scores 0.25 ROUGE−II Scores

ROUGE−SU4 Scores 0.08 0.04 0.2 0.06 0.02 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x (Root of Rank) x (Root of Rank) x (Root of Rank)

DUC 2002

0.09 0.4 0.14 0.08 0.35 0.12 0.07

0.3 0.06 0.1 (d) (e) (f) 0.05 0.25

ROUGE−I Scores 0.08 ROUGE−II Scores 0.04 ROUGE−SU4 Scores 0.2 0.06 0.03 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x (Root of Rank) x (Root of Rank) x (Root of Rank)

Figure 3.5 Summary quality v/s xth root of rank metric

Figure 3.6, show graphs for ROUGE scores, when y is varied between 1 and 5. It is seen that for most of the summaries an optimum quality is achieved for y = 2. This remains consistent for higher values (3, 4 and 5). For higher roots of weights scores of sentences are closer to each other. This helps because we are adding the scores over large number of dimensions (equals to total number of different words in documents). Thus, for optimal summary formation using CMDS we use x = 5 for rank metric and y = 2 for weight metric.

3.6.2 Extrinsic Evaluation

For extrinsic evaluation we have compared our system to previous state-of-the-art systems. Follow- ing is a brief description of those systems:

1. Random (baseline): Random sentences are selected for the summary.

2. LSA [17]: SVD is applied on the terms by sentences matrix and highest ranking sentences are selected for summary.

33 DUC 2001 0.5 0.12 0.18 0.45 0.1 0.16 0.4 0.14 0.35 0.08 0.12 (a) (b) 0.3 (c) 0.06 0.1 ROUGE−I Scores 0.25 ROUGE−II Scores 0.08 0.04 ROUGE−SU4 Scores 0.2 0.06 0.02 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 y (Root of Weight) y (Root of Weight) y (Root of Weight)

DUC 2002 0.15 0.09 0.4 0.08 0.35 0.07

0.3 0.06 0.1 (d) (e) (f) 0.05 0.25 ROUGE−I Scores ROUGE−II Scores

0.04 ROUGE−SU4 Scores 0.2 0.03 0.05 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 y (Root of Weight) y (Root of Weight) y (Root of Weight)

Figure 3.6 Summary quality v/s yth root of weight metric

3. TR-* [44]: Graph14 based text summarization method called TextRank. This method considers the sentences as graph nodes then uses modified HITS [29], Positional Power Function [21] and PageRank [50] algorithms to rank the sentences. Top ranked sentences are taken into summary.

4. ClusterHITS [61]: Graph based text summarization method in which topic clusters are consid- ered as hubs and sentences are considered as authorities then uses HITS algorithm to rank the sentences. Top most ranked sentences are taken into summary.

5. DSDR-nonlin [20]: Data reconstruction based text summarization approach that generates sum- mary which gives the best reconstruction of the original document. Nonnegative linear recon- struction, which allows only additive, not subtractive, linear reconstructions

Table 3.1 1 shows ROUGE-II scores for all the systems on DUC-2001 and DUC-2002 data. Our system (CMDS) outperforms all other systems. The improvement increases with increase in summary size. Figure 3.7, shows the graphical representation of all the ROUGE scores for all the systems. It can be observed that our system performs better for longer summaries. We attribute this observation to two reasons: first, sentence scoring favors those sentences which give maximum coverage in their conceptual space. This is more probable for longer sentences. Second, redundancy removal further helps by capturing as many concepts as the summary size allows. These evaluations show that CMDS is more effective to generate longer summaries while being comparable for shorter summaries.

14Undirected version of the graph is used because, the direction could not be decided between sentences of different documents.

34 Table 3.1 Average F-measure (ROUGE-2) scores for various state-of-the-art systems DUC 2001 DUC 2002 System Summary size (in words) Summary size (in words) 50 100 200 400 50 100 200 Random 0.01639 0.03292 0.0452 0.08138 0.02227 0.03835 0.06043 LSA 0.02641 0.03928 0.06158 0.08844 0.03072 0.04427 0.06703 TR(HITS) 0.03659 0.05986 0.07598 0.10414 0.05124 0.05941 0.08394 TR(PPF) 0.0361 0.04597 0.069 0.09753 0.04955 0.05438 0.07091 TR(PageRank) 0.03237 0.05442 0.07692 0.10772 0.0475 0.06397 0.08709 ClusterHITS 0.03907 0.05234 0.07457 0.09648 0.04949 0.05879 0.08026 DSDR-nonlin 0.02638 0.04721 0.06862 0.1021 0.02933 0.04674 0.07541 CMDS 0.03971 0.06155 0.08154 0.11467 0.05209 0.0682 0.09575

3.7 Summary and Conclusion

In this work we have used an inferential space to find the most informative summary for a set of documents. The space has the ability to evolve by combination, such that more refined and contextually clear concepts can be obtained. Based on these characteristics, we have formulated a conjecture. The conjecture suggests that an ideal summary should encapsulate as many concepts as possible which are present in the documents. Following this we introduced two scoring metrics to score the sentences. These scoring metrics have produced quality summaries which have been verified by extrinsic experi- ments. Intrinsic experiments provide an insightful picture of these scoring parameters on the quality of a summary. This work contributed in the following manner: First, Extension of the concept combination char- acteristic of conceptual space to define the sentences. Second, Proposition of heuristics to combine concepts which underline the extension differing from previous approaches. Third, A novel theoretical framework for summary formation with experimentally estimated parameters. We conclude that conceptual spaces is an efficient cognitive model to represent text. Its characteris- tics allows us to solve various problems, summarization being one of them. This is demonstrated from Figure 3.8 which shows the final summary for d15c documents in DUC 2001 dataset.

35 50 Words 100 Words 200 Words 400 Words 50 Words 100 Words 200 Words

DUC 2001 DUC 2002 0.5 0.4

0.4 0.35

0.3

(a) 0.3 (b) 0.25

0.2 0.2 0.15 ROUGE−I Scores ROUGE−I Scores 0.1 0.1 0.05 0 0

LSA LSA CMDS CMDS Random TR(PPF) Random TR(HITS) TR(PPF) TR(HITS) ClusterHITS ClusterHITS DSDR−nonlin DSDR−nonlin TR(PageRank) TR(PageRank)

0.12 0.1

0.1 0.08

0.08 (c) (d) 0.06 0.06 0.04 ROUGE−II Scores

ROUGE−II Scores 0.04

0.02 0.02

0 0

LSA LSA CMDS CMDS Random TR(PPF) Random TR(HITS) TR(PPF) TR(HITS) ClusterHITS ClusterHITS DSDR−nonlin DSDR−nonlin TR(PageRank) TR(PageRank)

0.15

0.15

0.1 (e) (f) 0.1

0.05 ROUGE−SU4 Scores 0.05 ROUGE−SU4 Scores

0 0

LSA LSA CMDS CMDS Random TR(HITS) TR(PPF) Random TR(HITS) TR(PPF) ClusterHITS ClusterHITS DSDR−nonlin DSDR−nonlin TR(PageRank) TR(PageRank)

Figure 3.7 Graphical representation of ROUGE scores for all the systems

36 1. The doctors warned that besides being at risk of getting tuberculosis themselves, AIDS- infected addicts who carry the TB bacteria also may pass the germs to people they live with, to health care workers and other people. 2. The agency estimated that between 15 million and 20 million adults will be infected with HIV by the year 2000, and it predicted that the number of cases and deaths from tu- berculosis will rise sharply as a result, especially in sub-Saharan Africa, Latin America and Southeast Asia. 3. The health department said it is providing tuberculosis testing and treatment for the Hu- man Resources Administration’s program for the homeless, and will train staff members on tuberculosis prevention and control. 4. The U.N. agency, in its first comprehensive look at global tuberculosis in a decade, said the disease kills nearly 3 million people a year, most of them between the ages of 15 and 59, ”the segment of the population that is economically most productive”. 5. While most people with the AIDS virus eventually go on to get acquired immune defi- ciency syndrome, people who carry the tuberculosis bacteria ordinarily have only about a 10 percent life-long risk of getting TB. 6. Snider said 10 million to 15 million Americans have been infected with the tuberculosis germ, but only a small percentage of them develop the disease because their immune system was strong enough to prevent the disease from developing. 7. The department also has an established residence for homeless tuberculosis patients, and is working with substance-abuse treatment services to extend tuberculosis prevention in its programs. 8. The Board of Health approved a resolution last year requiring all children entering city schools to be tested. 9. Seven of the eight TB cases occurred in people who were already infected with tubercu- losis bacteria before the study began. 10. NEW YORK – The incidence of active tuberculosis cases in the city rose 38% in 1990, to 3,520 cases, according to the health commissioner.

Figure 3.8 Final CMDS Summary for DUC2001: d15c

37 Chapter 4

Multilingual Multidocument Text Summarization

Knowledge cannot be bound by languages however, its expression and expansion depends on its linguistic source. People can relate to a text source if it is written in their native language. This means that the information present in the most common language will be accepted by larger part of society. However, the information in lesser known languages is left out by the greater part of the society. Hence, a tool which can cover and extract information from multiple linguistic sources can give diverse and complete information. Summarization aims to give a complete overview of a set of documents (topically similar). Language bound methods can handle text in a single language. The methods may be highly accurate for their domain language but, may not be able to work upon documents in different languages. Thus, summarization methods which can generate summaries from text coming from multiple sources is not only necessary but will be needed.

Most of the previous approaches use clustering and translation of documents to form summary. The basic idea of these techniques [5, 12, 6] is that they collect similar information together by clustering techniques for every language. Then, they find similar clusters across different language by translating clusters in one language and identifying similar cluster in another language. Final output is produced in user desired language by substituting all the different linguistic sentences with a similar sentence in the required language. It has been shown by Chen et al. [5] that translation after clustering performs better than translation before clustering.

However, earlier approaches do not address an important aspect: What is the effect of “Added noisy information” on the summary quality. The noise part of the information originated due to interaction of two languages either by translation or transliteration. The addition of information part originates from the fact that different documents can contain different information. Furthermore, if there is a difference in information across different languages then multi-lingual summary should be able to encapsulate all this information. Figure 4.1 depicts the concept of Noisy Information. The common part (overlapping part in the figure) usually contains the general overview of the topic covered in both documents along with some key points. The dotted part represents the new information which should be incorporated in the summary.

38 Figure 4.1 Added Noisy Information

4.1 MultiLingual Summarization using Jensen-Shannon Divergence

α β γ

(1−α) (1−β) (1−γ) Final Summary = d S(en)+ d S(hi)+ d S(te)

Figure 4.2 Architecture of the system

Figure 4.2 shows the system architecture. As shown, the set of English, Hindi and Telugu documents are translated using Google translation. Then, original documents and translated documents are summa- rized using generic summarizers for each language. For each summary, Jensen-Shannon divergence is calculated which gives us relevance for English (α), Hindi (β) and Telugu (γ) summaries. Final multi- lingual summary is created using English, Hindi and Telugu summaries with weights proportional to α, β and γ.

39 4.1.1 Translation

Google translation tool offered all the 6 translation utilities : [en → hi], [hi → en], [en → te], [te → en], [hi → te] and [te → hi]1. According to an analysis of Google Translate by Aiken et al [22], translations among Western languages are generally best and those among Asian languages are often poor. Furthermore, the average 2 BLEU[52] score for English Hindi pair was 9.5 which is significantly low compared to English French pair having BLEU score 91. However, they did not report BLEU scores involving Telugu language. From manual analysis 3 we found that translations involving Telugu language are low in quality. Also, the translation involving Telugu is in its initial development stages (Beta phase). One major drawback of the translation system is that even though translation of individual words are correct their placement in the sentence is not always right. Another problem with translation system is that some non-named entities were transliterated instead of being translated. This adds to the noise for summarization system. In this work we do not try to address all the translation based issues individually. Instead, we analyze and understand that, which summarization approach is more robust and can handle all the issues collectively.

4.1.2 Generic Summarizers

Following state-of-the-art were built to form the summaries:

1. LSA [17]: SVD is applied on the terms by sentences matrix and highest ranking sentences are selected for summary based on their respective eigen values.

2. CMDS: The approach uses HAL [40] which is a matrix representing co-occurrence pattern of words in a text. Cumulative co-occurrence score is obtained by accumulating the scores between the two words over the whole document by moving a window of size K.4 The co-occurrence score

between two words (wi, wj) at a distance k is calculated as a product between (K − k + 1) and frequency of their occurrence at a distance k.

∑K Score(wi|wj) = (nk ∗ (K − k + 1)) (4.1) k=1 Concept combination [59] is used to create final sentence vectors which are then used to form final summaries using rank and weight metrics.

3. TR-PR [44]: An effective Graph based text summarization method called TextRank. In this ap- proach sentences represent vertices (V ) and weigted edges are established using similarity metric

1en - English, hi - Hindi, te - Telugu. 2Calculated by averaging the BLEU scores of the two variants of translation. Eg. English to Hindi and Hindi to English 3An annotator was asked to manually identify the correctness of translations involving Telugu language. 4Value of K is set to 5, for this work.

40 5 W between the two sentences (wij). Then, the following PageRank algorithm (modified to PR ) is used to rank the sentences.

∑ W W PR (Vj) PR (Vi) = (1 − d) + d ∗ wji ∑ (4.2) w Vj ϵIn(Vi) kj VkϵOut(Vj )

Here, d is a parameter that is set between 0 and 1. Top ranked sentences are taken into the summary.

4. ClusterHITS [61]: This summarization method uses cluster-based HITS model where topic clus- ters are considered as hubs and sentences are considered as authorities then uses HITS [29] al- gorithm to rank the sentences. Top most ranked sentences (in decreasing order of their authority scores) are taken into summary.

4.1.3 Jensen-Shannon (JS) Divergence

Jensen-Shannon divergence between two probability distributions show the dissimilarity between the distributions. We use the divergence to know the divergence of a summary from its input. The diver- gence scores indicate the quality of a summary and lower values indicate better summaries. Based on the assumption that good summaries are characterized by low divergence between probability distribution of words in the input and summary and by high similarity with the input. JS divergence between two probability distribution (P and Q) is given by,

1 J(P ||Q) = [D(P ||A) + D(Q||A)] (4.3) 2

∑ p (w) where, D(P ||Q) = p (w)log P P 2 p (w) w Q P + Q A = 2 Here, P represents summary and Q represents input documents to which summary is compared.

Furthermore, pP (w) and pQ(w) represent probability of word w in P and Q respectively. The JS diver- gence shows the dissimilarity between two probability distributions. Therefore, low values of divergence indicate better summaries. The value of JS divergence is always defined and bounded between 0 and 1.

4.1.4 Final Summary

Final multi-lingual summary is constituted with sentences from all the three summaries 6. The con- tent taken from each summary is proportional to their JS divergence scores. Therefore, summaries

5Similarity between two sentences is proportional to the number of common words. 6The sentences in these summaries are already ordered in decreasing order of their scores.

41 having low scores should constitute more to the final summary. Based on this principle we generated final summaries in the following manner.

Let α, β and γ denote JS Divergence for English, Hindi and Telugu documents respectively. Let S(l) denote summary where l denote language. Then, the final summary is given by,

(1 − α) (1 − β) (1 − γ) F inal Summary = S(en) + S(hi) + S(te) (4.4) d d d

where, d = (1 − α) + (1 − β) + (1 − γ)

Note that, (1 − divergence measure) is an effective weight measure because its higher value sug- gests better monolingual summary with respect to the language. Normalization with d is necessary so that all the weights are comparable and their sum equals 1.

4.1.4.1 Redundancy Removal

Reduction in redundancy is an important part of forming summaries. It becomes even more im- portant in multi-lingual summarization. This is because monolingual summaries have been generated individually for each language on overlapping information. As explained earlier, each sentence has its translation in other languages (here two). These translations are used to reduce redundancy in sum- maries. Summary S, is initialized to the product of user language (say l) and the corresponding language 1−{α,β,γ} weight ( d ). Then we translate summaries in other language to l using T ranslate(l, s) function. Where, the parameter l gives target language and s is the sentence to be translated. We calculate the Jaccard Similarity between the translated sentence and the summary.

|{w|wϵs ∧ wϵS}| JaccardSimilarity(s, S) = (4.5) |{w|wϵs ∨ wϵS}|

The sentence is added to the summary if the value of similarity is less than a threshold δ. Values of δ were chosen for each system such that they gave minimum divergence on non-translated input. Algorithm 3, shows the pseudo-code for forming the final summary. All the sentences in the final summary are in user preferred language.

42 Algorithm 3 Summary Formation Input:

• User language : l // Assuming english

• English Summary : Sen

• Hindi Summary : Shi

• Telugu Summary : Ste • Summary weights : α, β, γ

Output: ∪ ∪ • Summary : S ⊂ (Sen Shi Ste)

Procedure: ← (1−α) initialize S d Sen

initialize TShi ← NULL

initialize TSte ← NULL

for ∀sϵShi do

TShi ← T ranslation(en, s) end for i ← 1 (1−α) 1−β while (S < d Sen + d Shi) do

if (JaccardSimilarity(TShi[i],S) < δ) then

S = S + TShi[i] end if i ← i + 1 end while

for ∀sϵSte do

TSte ← T ranslation(en, s) end for i ← 1 (1−α) 1−β 1−γ while (S < d Sen + d Shi + d Ste) do

if (JaccardSimilarity(TSte[i],S) < δ) then

S = S + TSte[i] end if i ← i + 1 end while 43 return S 4.2 Dataset and Evaluation Metric

The dataset consists of 10 news topics. Each news topic consists of 5 English, 5 Hindi and 5 Telugu documents. After translation, 10 documents are added to every language for each topic. Finally, we had a dataset of 10 topics with 45 documents in each topic. Experiments were conducted on these 450 documents. In general, summaries are evaluated by comparing them with human generated summaries. However, generating human summaries for multilingual summaries is a challenging task. The challenge lies, in that the annotator should be proficient in all the languages for which summaries are being generated. So we had to use evaluation methods which do not involve human summaries. Louis et al. [38] had shown that Jensen-Shannon divergence can be used to evaluate summaries without human models. They had shown that ranks produced by the Jensen-Shannon measure correlates with rankings produced by ROUGE-2 and ROUGE-SU2 [37]. The method was further studied by Saggion et al. [55] and it was found that JS divergence is an effective metric to evaluate multi-lingual summaries.

4.3 Experiments

The experiments were divided into three categories. 1. Monolingual summarization: Generating summaries of a single language. 2. Bilingual summarization: Generating summaries using document sets from two languages. 3. Trilingual summarization: Generating summaries using document of all the three languages. Each category was further subdivided into multiple classes based on their evaluation criteria as follows. Case 1 (C1): Generating summaries using original set of documents and evaluating them with original set of documents. Case 2 (C2): Generating summaries using original and added set of documents and evaluating them with original set of documents. Case 3 (C3): Generating summaries using original and added set of documents and evaluating them with original and added set of documents. There can be another case when we evaluate summaries from added documents with original set of documents. However, an accurate summary cannot be obtained if we do not take original documents into account. Furthermore, the added set of documents is highly noisy due to inaccurate translation. Hence, this case was ignored. Comparison of first and second cases will help us to decide if the added information is useful for summary generation. The results of these comparisons will validate the correctness of our approach. Observation of third case will allows us to understand the effect of noise on various summarization systems. It will help us decide that which systems are more robust to noisy information.

44 4.4 Results and discussion

Table 4.1 JS Divergence of monolingual summaries Case CMDS Cluster-HITS LSA TR-PR C1 0.28 0.275 0.28 0.264 C2 0.293 0.305 0.286 0.274 C3 0.334 0.340 0.33 0.325

Table 4.1 shows the average (Scores are averaged over all the three languages) divergence scores of monolingual summaries. For readability purposes best scores have been shown in bold for all the tables. Comparing cases C1 and C2 scores show that, for monolingual summarization addition of information results in decrement of summary quality. Comparison between Cases C2 and C3 show that, divergence is higher when additional languages increase. We believe the decline in summary quality occurs because the added translated documents add more noise than information. This results in a negative effect on various document characteristics which further affects the summary quality. Case C3 divergence scores show that TextRank algorithm is most robust against noise and can perform slightly better on noisy documents for generating summary. However, these results indicate that additional information is not helpful for performing monolingual summarization. Instead, it acts as a noisy factor reducing summary quality for all the systems.

Table 4.2 JS Divergence of bilingual summaries Case CMDS Cluster-HITS LSA TR-PR C1 0.36 0.361 0.367 0.354 C2 0.32 0.344 0.329 0.331 C3 0.379 0.39 0.384 0.372

The divergence scores in Table 4.2 show that, addition of information results in an improvement in quality of bilingual summaries for all the systems (compare C1 and C2). We believe that the im- provement is the result of added information. Although, the new information is noisy it tries to fill the information gaps present in individual language documents. This allows for a more holistic summaries covering more information. Additional information may also reinforce, some important information present in both the documents. From system comparison perspective, average scores (of cases C1, C2 and C3) for CMDS are better. The relative difference (0.4) between C1 and C2 is highest for CMDS. Average C3 divergence scores show that TextRank algorithm is most robust against noise.

Table 4.3 JS Divergence of trilingual summaries Case CMDS Cluster-HITS LSA TR-PR C1 0.397 0.393 0.404 0.392 C2 0.341 0.358 0.353 0.34 C3 0.412 0.421 0.419 0.398

45 Observe Table 4.3 for divergence scores for trilingual summaries. The scores show that for trilin- gual summarization addition of information results in the improvement in summary quality for all the systems. We believe the reasons for this behavior similar to that of bilingual summarization. Also, note that difference in divergence (between C1 and C2) is higher for trilingual summaries as compared to bilingual summaries for all the systems. This consolidates the above observation that added information results in improvement of summary quality for multi-lingual summarization. It also shows that added information works better as the number of languages increase. From system comparison perspective, scores of TextRank are slightly better as compared to CMDS. These two systems give similar perfor- mance on original set of documents. However, C3 shows that TextRank is most robust against noise. From the above observations, we find that bilingual and trilingual summarization show positive response to “Added noisy information”. Whereas, an opposite observation is seen for monolingual summarization. We also find that LSA was performing reasonably well for monolingual as compared to multi-lingual summarization. It is possible that ranking of latent topics is not similar across two languages resulting in their mismatch when computing multi-lingual summaries. On the other hand, TextRank is reasonably robust against noise in multi-lingual domain. This is because TextRank uses similarity measure which does not depend on sentence structure. Similarly, CMDS uses co-occurrence pattern of words which is not highly dependent on sentence structure and sufficiently large K can over- come that problem.

4.5 Conclusion and Future Work

Summarizing the documents in different languages is not possible with existing mono-lingual tech- niques. As for the existing multilingual techniques, they assume the presence of parallel sentences across the documents in different languages. In this work, we have proposed an architecture for the task of multi-lingual summarization which does not make the assumption of parallel sentences. We focus to incorporate different information in different languages in the final system summaries. Our system can be scaled to any number of languages (given that translation are available for the languages). The system uses Jensen-Shannon divergence for scoring as well as for combining the summaries. Divergence be- tween summary and its corresponding input is a strong indication of inferential quality of the summary. Good summaries have usually low divergence from their input. The divergence based measures have been proven competitive to Recall oriented measures (like ROUGE) for evaluation. The approach is effective and is validated by our experimental results (evaluated using Jensen-Shannon divergence). We found that increasing the number of languages benefit the quality of summary. Information in different languages tries to patch up the incomplete information in each other. Our approach shows that diver- gence based measures can help to solve the problem of multilingual summarization. Co-occurrence and similarity based representation are robust to noise and further improvement in translation systems will improve the quality of multi-lingual summaries.

46 One of the major challenges in multilingual summarization is evaluation of the system summaries. There is no standard tool to evaluate multilingual summaries, JS divergence is the only available method to evaluate multilingual summaries. Manual evaluation is an expensive and tedious way to evaluate multilingual summaries. Hence, there is an urgent requirement of multi-lingual evaluation tools for summarization. In conclusion, from the existing evaluation we cannot definitively say which method suits multi-lingual summaries the most, however, we are confident that introduction of more languages leverages the quality of summaries.

47 Chapter 5

Summarization of Online Conversations in the domain of Debates

In this chapter we apply summarization for another domain namely, online conversations. We show that HAL features when combined with other features are useful for any specific task of summarization. We also show that HAL features can work effectively even if used with other features that are not present in HAL space. The domain of social networking offers many summarization challenges. Most of the text present in social media is sentiment rich and is useful for sentiment mining. We have used the sentiment rich features along with HAL features and found that they perform better than previous methods which only rely on sentiment based features. Within the domain of online conversations we focus on debates. Online debate forum is a platform where people can take a stance and argue in support or opposition of debate topics. An important feature of such forums is that, they are dynamic and increase rapidly. In such situations, effective opinion summarization approaches are needed so that readers need not go through the entire debate. This domain differs from chats and casual conversation because it is more formal and focuses on specific topics. It may be possible that the argument contains various different factual knowledge but they are usually related to one or the other topic. Debates are also different from news and blogs because they are comparatively rich in sentiment.

5.1 Approach Used

Summaries are generated by extracting the most relevant Dialogue Acts (DAs) 1 from the original documents. Their relevance is calculated by modeling various aspects of DAs and comparing them against other DAs. We have referred aspects as features in the rest of the text. Various types of features are considered covering morphological structure, topic relevance, sentiment relevance and document relevance. Table 5.1 lists the set of sentence features used to rank the DAs. Following which, we describe each of these features and their calculation in the subsequent subsections.

1Dialogue Act is the smallest unit of debate.

48 Feature Category Feature Names Topic Directed Sentiment Score Topic Relevance Topic Co-occurrence Document Relevance tf-idf Sentiment Score Number of Sentiment Words Sentiment Relevance Sentiment Strength Sentence position Positional and Coverage Relevance Sentence length

Table 5.1 Argument Structure Examples

5.1.1 Calculating Topic Relevance

Debate posts represent users’ opinion towards debate topics. Therefore, sentences which provide in- formation or express opinion about debate topics are most important in the context of debate summaries. We use topic directed sentiment scores and topic co-occurrence measure to capture topic relevance of the DAs.

5.1.1.1 Topic Directed Sentiment Score

DAs carrying topic related sentiments are very important in the context of online debates. They represent the sentiments directed by DA toward debate topics and thus, a key feature in the task of debate summarization. In the proposed approach, the sentiment score directed towards debate topics is calculated using dependency parse of the DAs and sentiment lexicon SentiWordNet [2]. Pronoun referencing is resolved using Stanford co-reference resolution system [33]. Then using Stanford dependency parse [8], DAs are represented in tree format where each node represents a DA word storing its sentiment score and the edges represent dependency relations. Each DA word is looked in SentiWordNet and the sentiment score calculated using Algorithm 4 is stored in the word’s tree node.

Algorithm 4 Word Sentiment Score 1: S ← Senses of word W 2: wordScore ← 0 3: for all s ∈ S do 4: sscore = sposScore − snegscore 5: wordScore = wordScore + sscore 6: end for wordScore 7: wordScore = |S|

SentiWordNet is a lexical corpus used for opinion mining. It stores positive and negative sentiment scores for every sense of the word present in WordNet [13]. For words missing from SentiWordNet, average of sentiment scores of its synset member words is stored in the word’s tree node, otherwise zero sentiment score is stored. If words are modified by negation words like {’never’,’not’,’nonetheless’, etc.}, their sentiment scores are negated.

49 In noun phrases like, ‘great warrior’, ‘cruel person’, etc. first word being the adjective of the latter, influences its sentiment score. Thus, based on the semantic significance of the dependency relation each edge holds, sentiment score of parent nodes are updated with that of child nodes using Algorithm 5. In DAs like “Batman killed a bad guy.”, sentiment score of word ‘Batman’ is affected by action ‘kill’. Thus, for verb-predicate relations like {‘nsubj’,‘dobj’,‘cobj’,‘iobj’, etc.}, predicate sentiment scores are updated with that of verb scores using Algorithm 5.

Algorithm 5 Update Word Sentiment Score ′ 1: node ← W ord s T ree Node ′ 2: childs ← W ord s child nodes 3: for all c ∈ childs do 4: updateScore(c) 5: nodescore ← sign(nodescore) ∗ (|nodescore| + (cscore)) 6: end for

Tree structure and recursive nature of Algorithm 5 ensures that sentiment scores of child nodes are updated before updating their parents’ sentiment scores. Table 5.2 lists the semantically significant dependency relations used to update parent node scores.

Modification Type Dependency Relations Noun Modifying nn, amod, appos, abbrev, inf- mod, poss, rcmod, rel, prep Verb Modifying advmod, acomp, advcl, ccomp, prt, purpcl, xcomp, parataxis, prep

Table 5.2 List of Dependency Relations

Extended Targets (ET): Extended targets are the entities closely related to debate topics. For ex- ample, ‘Joker’, ‘Clarke Kent’ are related to ‘Batman’ and ‘Darth Vader’, ‘Yoda’ to ‘Star Wars’. To extract the extended targets, we capture named entities (NE) from Wikipedia page of the debate topic using Stanford Named Entity Recognizer [14] and sort them based on their frequency. Out of top-k (k = 20) NEs, some can belong to both of the debate topics. For example, ‘DC Comics’ is common between ‘Superman’ and ‘Batman’. We remove these NEs from individual lists and the remaining NEs are treated as extended targets (extendedTargets) of the debate topics. Now that we have a list of extended targets for debate topics and a sentiment score for each DA word, topic directed sentiment scores are calculated for each debate topic using Equation 5.1. ∑ T opic ScoreDA = (Score(w)) (5.1) w∈DA w∈ET(Topic)

We refer to these scores as AScore and BScore representing scores directed towards topics A and B in debate between these two topics respectively.

50 Absolute value of both topic directed sentiment scores are added reprensenting DA’s topic directed sentiment score. These scores are normalized with the sum of topic directed sentiment score of all the DAs.

5.1.1.2 Topic Co-occurrence Measure

Topic co-occurrence measure captures DAs containing high sentiment words which highly co-occur with debate topic. Extended targets previously described represent debate topic entities. Topic co- occurrence measure is computed using HAL from the Equation 5.2, capturing co-occurrence measure of DA words and their sentiment strengths. Sentiment score is calculated using Algorithm 4. ( ) ∑ ∑ Co − occurDA = (HAL(w|t)) ∗ sentiScore(w) (5.2) w∈DA t∈ET Topic-occurrence measure is normalized with the sum of co-occurrence scores of all the DAs. We sum up topic directed sentiment scores and topic co-occurrence measure giving us the topic relevance feature score for DAs.

5.1.2 Calculating Document Relevance

Product of tf-idf and sentiment score of the words are used to compute document relevance of the DAs using Equation 5.3. ∑ tf − idfDA = (tf − idf(w) ∗ sentiScore(w)) (5.3) w ∈ DA Tf-idf score reflects how important a word is to a document in a collection or corpus. Sentiment score carrying words’ sentiment strength reflects subjective importance of the word in the context of opinion DAs. Thus, this feature captures the DAs containing highly frequent sentiment rich words. Document relevance score of the whole debate DAs is used to normalize individual scores.

5.1.3 Calculating Sentiment Relevance

This dimension captures the presence of sentiment carrying words and their strength in the DAs.

1. sentiCount is the count of sentiment carrying words in the DAs. sentiCount is normalized with total number of sentiment words present in the debate.

2. Sentiment score of each DA word is calculated using Algorithm 4 and Equation 5.4 is used to compute DAs’ sentiment strength. Sentiment score for each DA is normalized with overall de- bate’s sentiment score. ∑ sentiScoreDA = sentiScore(w) (5.4) w∈DA

51 Sentiment score and number of sentiment words in DAs are added which represents the sentiment rele- vance feature score of DAs.

5.1.4 Positional and Coverage Relevance

5.1.4.1 Sentence Position

Sentence position plays important role in predicting the presence of DAs in summary. In debates, initial and ending DAs of the debate posts are more important than the middle ones. So, we have used Equation 5.5 to compute sentences’ position based score which gives higher values for initial and ending sentence than the middle ones. This score is normalized by dividing it with number of DAs in debate posts2. N | − DAposition | posScore = 2 ,N = T otal DAs in P ost (5.5) DA N

5.1.4.2 Sentence Length

As the longer sentences tend to contain more information, we have used sentence length as document context feature. It also avoids short sentences (smaller than 5 words) which are less likely to contribute to summary because of incompleteness or less information. Sentence length is the number of words in the DAs. We have normalized the sentence length with the number of words in the whole debate. We sum sentence position and sentence length scores to compute document context feature score of DAs. Note that all the values have been normalized over all DAs in the debate so that the different feature scores are comparable.

5.1.5 Calculating Relevance of a Dialogue Act

After the generation of all the aforementioned features we calculate the score of a DA by their linear combination. Equation 5.6 is used to assign score to each DA s.

score(s) =λtopicReltopicRel(s, topics) + λdocReldocRel(s, D) (5.6) + λsentiRelsentiRel(s) + λpcRelconRel(s, D)

Where, λtopicRel, λdocRel, λsentiRel and λpcRel represent the weights assigned to each feature. Grid search to compute the best weight values. Top ranked DAs are chosen until summary length constraint is satisfied. Grid search is an algorithm which exhaustively search through a manually specified subset of the parameter values. Since, our search space is small [0 − 1] with intervals of 0.1 for each weight parameter; time taken for grid search to estimate the weights was very small. Next, we describe the experimental setup used to test our system.

2Post represents a user argument and consists of multiple DAs

52 5.2 Experimental Setup

In this study, we extracted 10 online debate discussions from www.convinceme.net. These discus- sions are freely available on aforementioned site and Table 5.3 shows the statistics of the dataset used. Each of these discussions focus upon different topics allowing us to produce results over various do- mains. Number of users Number of posts Number of DA 1168 1945 23681

Table 5.3 Statistics of the dataset

For evaluation, extractive gold set summaries were created by 2 language editors. They were asked to create 500, 1000, 1500, 2000 word summaries. Inter-editor agreement was calculated to be 71.7%3. The editors were asked to select the sentences on the following order:

1. Sentiment rich which contains highly topic-relevant information.

2. Sentiment rich with relevant information (low noise).

3. Less subjective content but rich in information.

4. Highly subjective sentence with no relevant information and factual statements should be selected with care, the reason being that they add noise without taking any particular stand.

All the evaluation scores are computed using ROUGE [35] which stands for Recall Oriented Under- study of Gisting Evaluation. It has been widely used by DUC to evaluate system summaries. ROUGE measures summary quality by counting overlapping units such as the n-gram, word-sequences and word- pairs between system summaries and human summaries. Three automatic evaluation methods ROUGE- 1, ROUGE-2 and ROUGE-L were chosen to calculate scores. They compute unigram recall, bigram recall and longest common subsequence respectively. We have conducted the following experiments: 1. Comparison of DEBSumm summaries with proven baseline and state-of-the-art summarization systems explained in Section 5.3.

2. Effect of variable summary size on DEBSumm and state-of-the-art systems.

5.3 Results and Discussion

Grid search was used to compute best parameter values for Equation 5.6. Following values gave the

best results as indicated by ROUGE results: λtopicRel = 0.3, λdocRel = 0.1, λsentiRel = 0.5, λpcRel = 0.1.4 3Number of common sentences were averaged over the complete set of debates. 4All the further experiments were conducted using these values.

53 Scores show that better summaries are obtained when sentiment rich sentences are selected. Fur- thermore, sentiments which are directed towards the topic words or co-occur with the topic play an important role to score the DAs. These trends are drawn from the calculated weight of each set of features where highest weight sentiment relevance features followed by topic relevance features. Other measures like sentence position and length give a better fine tuning to summaries as they help to dif- ferentiate between similar sentences. Low weightage to document relevance score is understandable because it is a redundant feature to identify sentiment rich document words. We compared our system (DEBSumm) to the following systems:

1. Max-length [16]: Longest sentences were selected from all the users. In case, summary is short of length second-longest sentences are selected. This step is iterated until summary reaches required length. This is a proven strong baseline for conversation summarization.

2. Lead [63]: Top sentences from each user were selected where each sentence has to be greater than 4 words. In case, summary is short of length, next sentence is selected. This step is iterated until summary reaches required length.

3. pHAL [26]: Sentence (S) score was calculated by combining the pHAL scores of each of sen- tence words. pHAL score of each word is calculated as follows, ∑ HAL(w′|w) pHAL(w) = n(w) ∗ K w′ϵET ∑ Score(S) = (P (wi) × pHAL(wi))

wiϵS For summary creation, top scored sentences were selected from sorted list of sentences.

4. tf-Idf [1]: Sentences were scored by combining the tf-idf measures of their words5. For summary creation, top most sentences were selected from sorted list of sentences.

5. OpinionSumm [62]:6 This is a sentence scoring approach where sentence are scored based on their document similarity, topic relevance, sentiment relevance and length of the sentence. We have used the same parameter values experimentally calculated in their work. This is a state-of- the-art opinion summarization system.

In the field of generic summarization, system 2 and 4 are proven strong baseline and system 3 is a state-of-the-art system. Table 5.4 shows ROUGE scores (Average F-measure) of different systems. The summary size is taken to be 1000 words. Note that, each of the systems 1, 2, 3 and 4, is one of the lower weighted components of the function used to compute our (DEBSumm) scores. On the other hand, Opinion- Summ represents the higher weighted sentiment component of DEBSumm. The results show that

5Each user discussion is considered as a single document while calculating tf-idf values 6Note that OpinionSumm is the name given to this system to refer it, throughout, this paper only.

54 Table 5.4 ROUGE Scores (Average F-measure) of System Summaries (1000 words) System ROUGE-1 ROUGE-2 ROUGE-L Max-Length 0.49892 0.18453 0.48343 Lead 0.49068 0.14759 0.47839 pHAL 0.48985 0.16468 0.46955 tf-idf 0.49922 0.17585 0.48035 OpinionSumm 0.51631 0.20364 0.49849 DEBSumm 0.56833 0.27044 0.55326

DEBSumm comprehensively outperforms the state-of-the-art systems. They also show an improve- ment of 5.2% (ROUGE-1), 7.3% (ROUGE-2) and 5.5% (ROUGE-L) over OpinionSumm. The above results show that sentiment both topic directed and independent of it, is very important factor to compute effective summaries.

Figure 5.1 ROUGE-2 (Average F-measure) scores v/s Summary Size (in words)

Evaluating systems over variable summary size allows us to judge systems over wide range of sum- mary length. Shorter summaries require higher precision and longer summaries require high recall. As the summary size increases, number of sentences which add novel relevant information decreases. Thus, rate of change in scores is not significant. However, in our graph (Figure 5.1) we find that there is a slight decrease in scores of OpinionSumm and DEBSumm from 500 to 1000 words. We believe the reason of such a behavior to inclusion of new noisy data as compared to relevant data. This sug- gests that more relevance should be given to structural and document features over features representing sentiments. Overall, Figure 5.1 shows that DEBSumm consistently outperforms other systems over different summary sizes.

55 5.4 Conclusion and Future Work

Sentiment based features are the most important features in the summarization of online conversa- tions. Hence, focus is often laid upon sentiment mining techniques. Furthermore, advances in sentiment mining can indeed leverage the quality of mining text involving sentiments. However, there are certain features which are prevalent in all kinds of text and inferential feature is one of them. Inferential fea- ture captures the meaning of a given text, hence, any meaningful text need to have one. In this work we show that using powerful inferential feature like, HAL along with sentiment based features in not only helpful but makes more sense to the final result. Though, HAL feature is a singular feature in the complete feature set, it has been given a significant amount of weight. This shows that although they may play a lesser role in the overall task of opinion summarization they nevertheless, play an important part, enough to be used for the current task. We conclude by saying that contextual features in the form HAL may not be the most important feature for any given flavor of summarization but is a necessary requirement which should not be disregarded. As for the future work there are two ways this work can propagate. From the perspective of summa- rization, further investigation can be done on the usefulness of the HAL feature by using it in different textual domains. In particular to this work, we are averaging sentiment scores of all senses of a word, be- cause of poor state of word sense disambiguation in current scenario, which will not work in all cases. Some words carry different sentiment in different domains for example, refined word is good for oil products whereas bad in the domain of agriculture products. Therefore, word sense disambiguation and domain specific sentiment analysis can be used in our system. We can also include debate structure fea- tures. These features can leverage DAs occurring along with a high scoring DA. They can also identify related DAs spanned across different users and help in identifying relevant DAs more effectively.

56 Chapter 6

Conclusions

In this thesis we worked upon enhancing the quality of summaries by using a highly powerful com- putation model, HAL. It was based upon conceptual spaces which is a cognitive model by Gardenfors. This model could represent the text such that the representation could contain the inferential properties of the text. A representation of any textual entity (say T) is said to be inferential, if it can convey the meaning of T consistent with the context of T. One of the major challenge was to adapt HAL to query independent summarization deviating from its previous usage in query dependent summarization. The adaptability was possible because HAL had the property of concept combination using which we can define new concepts using existing concepts. So, using computational combination of word vectors we formed sentence vectors. These resultant vectors were highly rich and unique concepts in the original HAL space. After creating efficient representation they were used for the task of creating summaries. Prior to actual summary formation we laid down a conjecture to define a suitable summary, which emphasized on maximizing the sense (as concepts) retention in the summary. Summaries were formed using two metrics based on ranking and weight of sentence vector for each sense. Sentences were added to the summary such that they either ranked high or weighted high in maximum possible senses. To handle redundancy, we remove the sense (concepts) covered in the summary and chose new sentences based on the remaining senses. The experimental results showed that retaining inferential properties in the textual representation can leverage summary quality. On manual analysis of summaries we found that our summaries made more sense compared to other systems and could convey information (albeit without specific details) in the source documents. For future work, novel metrics to rank sentences can be proposed to improvise the sentence ranking algorithm. Another work in this direction can be based on introduction of weights to the dimensions based on their own inferential quality. This will help in boosting the sense carrying words of the document. After this, the next step was to use generic summarizer and build a multilingual summarizer over it. For this task, we combined monolingual summaries based on their individual quality. We also used three different text summarization methods to form final multilingual summaries. From our study of

57 two aspects on multilingual summarization; First, the effect of Added noisy information and Second, identifying summarization techniques suitable for multilingual summarization we came to the following conclusions:

1. Although the new (added) information is noisy it leverages the quality of summary. Moreover, summaries benefit from using documents in more languages. The reason being that, the core information is retained by the documents in all the languages guaranteeing their inclusion in the final summary. Interaction between multiple languages helps to, fill the gaps in information.

2. With manual analysis we found that sentences from local (less popular) languages covering inter- national topic are much more succinct compared to international language (more popular). For example, in a news related to the bilateral relations between North Korea and United States, Tel- ugu articles were shorter yet could cover all the salient points present in the English articles on the same topic. However, we have not given preference to languages based on their usage and leave that as a future work with regards to the first aspect of this study.

3. Noise percolates in the input documents due to the presence of translated text 1 which lacks cohesion and coherence.

4. Methods which could capture contextual information proved better for the task for multilingual summarization. Our summarizer (CMDS), based on HAL representation performed slightly low as compared to graph based TextRank algorithm because we also use structural aspects of the document in our representation. We attribute lower performance of CMDS to the distortion of sentence structure in the translated text. We also found that using latent dimensions is not useful because ranking of latent topics is not similar across two languages resulting in their mismatch when computing multi-lingual summaries. However, there are some methods (especially lexical chain based) which are yet to be used for multilingual summarization. We leave that as a future work with regards to the second aspect of this study.

Evaluation of multilingual summaries proved to be challenging because creation of human models is more expensive and has higher skill requirements. The reason being that, the creation of multilingual model summary requires the annotator to be proficient in all the languages of the source documents to be summarized. So, evaluation tools which do not use human models are required. However, not many methods are available for such task. We used Jensen-Shannon divergence measure to evaluate our systems. This measure has been shown to have medium to high correlation with model based evaluation methods, ROUGE and Pyramid scores. However, development of evaluation tools for multi-lingual summarization is another important aspect which must be addressed with due importance. In the final part of this thesis, we used HAL representation to form summaries in another domain namely, on-line conversations. HAL scores were used to calculate the topic relevance scores of the sentences. Topics were defined as the two opposing stance of the debate which were further boosted

1Translation was done using online translation tool by google. http://www.translate.google.com

58 by an extended list of words extracted from Wikipedia. In this work, usage of HAL is done similar to previous summarization methods where instead of query terms we have topic terms. Another important feature was the sentiment score of each sentence. The use of sentiment feature was important because debates are constituted by personal opinions marked by factual support. To capture the opinion aspect inclusion of sentiment information is necessary. Other superficial features based on length and coverage are used to fine tune the sentence scores. Evaluation has been done using ROUGE scores. The results show that topic relevance cannot alone be effective for the tasks where sentiments are involved. But they still leverage the quality of the final summaries and are equally important as sentiment features. From a multilingual perspective, this par- ticular task is more difficult because of the involvement of sentiments. We can identify topic relevance but to calculate sentiment relevance the sentiment lexicon for the text language is required. However, this issue is out of the scope of text summarization and is part of another information retrieval field, Sentiment Analysis. Overall, the task of summarization is a key cog in the solution to information overload. And im- parting meaning to it boosts its efficiency. Our work is a significant step to create rich, meaningful summaries. Multi-lingual summarization is one such form over which we have worked in this thesis. The results show promise in the field of multilingual summarization and the work showed that this form of summarization involves a lot of subproblems, evaluation being a major one. The evaluation of mul- tilingual summaries requires an immediate attention for this field to grow. Applying summarization for another domain namely, online conversations consolidates HAL as a powerful mode of representation which can be explored for other flavors of summarization.

59 Related Publications

• Ranade, Sarvesh, Jayant Gupta, Vasudeva Varma, and Radhika Mamidi. ”Online debate sum- marization using topic directed sentiment analysis.” In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining, p. 7. ACM, 2013.

• Vasudeva Varma, Sudheer Kovelamudi, Jayant Gupta, Nikhil Priyatam, Arpit Sood, Harshit Jain, Aditya Mogadala and Srikanth Reddy Vaddepally, IIIT Hyderabad in Summarization and Knowl- edge Base Population at TAC 2011, In Proceedings of Text Analysis Conference (TAC 11), Na- tional Institute of Standards and Technology Gaithersburg, Maryland USA. (2011)

60 Bibliography

[1] C. Aone, M. E. Okurowski, and J. Gorlinsky. Trainable, scalable summarization using robust nlp and machine learning. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, pages 62–66. Association for Computational Linguistics, 1998. [2] S. Baccianella, A. Esuli, and F. Sebastiani. Sentiwordnet 3.0: An enhanced lexical resource for senti- ment analysis and opinion mining. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, May. European Language Resources Association (ELRA), 2010. [3] R. Barzilay, M. Elhadad, et al. Using lexical chains for text summarization. In Proceedings of the ACL workshop on intelligent scalable text summarization, volume 17, pages 10–17, 1997. [4] J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335–336. ACM, 1998. [5] H.-H. Chen, J.-J. Kuo, and T.-C. Su. Clustering and visualization in a multi-lingual multi-document sum- marization system. In Advances in Information Retrieval, pages 266–280. Springer, 2003. [6] H.-H. Chen and C.-J. Lin. A multilingual news summarizer. In Proceedings of the 18th conference on Computational linguistics-Volume 1, pages 159–165. Association for Computational Linguistics, 2000. [7] H. Dalianis, M. Hassel, J. Wedekind, D. Haltrup, K. de Smedt, and T. C. Lech. Automatic text summariza- tion for the scandinavian languages. Nordisk Sprogteknologi, pages 2000–2004, 2002. [8] M. De Marneffe, B. MacCartney, and C. Manning. Generating typed dependency parses from phrase struc- ture parses. In Proceedings of LREC, volume 6, pages 449–454, 2006. [9] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990. [10] H. Edmundson. New methods in automatic extracting. Journal of the ACM (JACM), 16(2):264–285, 1969. [11] G. Erkan and D. Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. (JAIR), 22:457–479, 2004. [12] D. K. Evans, J. L. Klavans, and K. R. McKeown. Columbia newsblaster: multilingual news summarization on the web. In Demonstration Papers at HLT-NAACL 2004, pages 1–4. Association for Computational Linguistics, 2004.

61 [13] C. Fellbaum. Wordnet. Theory and Applications of Ontology: Computer Applications, pages 231–243, 2010. [14] J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363–370. Association for Computational Linguistics, 2005. [15] P. Gardenfors.¨ Conceptual spaces: The geometry of thought. MIT press, 2004. [16] D. Gillick, K. Riedhammer, B. Favre, and D. Hakkani-Tur. A global optimization framework for meet- ing summarization. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 4769–4772. IEEE, 2009. [17] Y. Gong and X. Liu. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 19–25. ACM, 2001. [18] M. Hassel and H. Dalianis. Portable text summarization. In Applied Natural Language Processing: Identi- fication, Investigation and Resolution :, number 1. IGI Global, 2011. [19] T. He, F. Li, and L. Ma. Document relevance identifying and its effect in query-focused text summarization. In Granular Computing (GrC), 2010 IEEE International Conference on, pages 206–211. IEEE, 2010. [20] Z. He, C. Chen, J. Bu, C. Wang, L. Zhang, D. Cai, and X. He. Document summarization based on data reconstruction. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. [21] P. Herings, G. Van der Laan, and D. Talman. Measuring the power of nodes in digraphs. Van der Laan, Gerard and Talman, Dolf JJ, Measuring the Power of Nodes in Digraphs (October 5, 2001), 2001. [22] H. Holzer, W. Bee, D. Nogueira, K. Semolini, C. Martin, M. Aiken, S. Balan, J. Zetzsche, S. F. Avval, M. Carl, et al. An analysis of google translate accuracy. [23] E. Hovy and C.-Y. Lin. Automated text summarization and the summarist system. In Proceedings of a work- shop on held at Baltimore, Maryland: October 13-15, 1998, pages 197–214. Association for Computational Linguistics, 1998. [24] M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177. ACM, 2004. [25] M. Hu, A. Sun, and E. Lim. Comments-oriented document summarization: understanding documents with readers feedback. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 291–298. Citeseer, 2008. [26] J. Jagadeesh, P. Pingali, and V. Varma. A relevance-based language modeling approach to duc 2005. In Proceedings of Document Understanding Conferences (along with HLT-EMNLP 2005), Vancouver, Canada, 2005. [27] J. Jagarlamudi, P. Pingali, and V. Varma. Query independent sentence scoring approach to duc 2006. In In Proceeding of Document Understanding Conference (DUC-2006), 2006.

62 [28] H. Jing, R. Barzilay, K. McKeown, and M. Elhadad. Summarization evaluation methods: Experiments and analysis. In AAAI Symposium on Intelligent Summarization, pages 51–59, 1998. [29] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604– 632, 1999. [30] K. Knight and D. Marcu. Statistics-based summarization-step one: Sentence compression. In AAAI/IAAI, pages 703–710, 2000. [31] L.-W. Ku, Y.-T. Liang, and H.-H. Chen. Opinion extraction, summarization and tracking in news and blog corpora. In Proceedings of AAAI-2006 spring symposium on computational approaches to analyzing weblogs, volume 2001, 2006. [32] J. Kupiec, J. Pedersen, and F. Chen. A trainable document summarizer. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, pages 68–73. ACM, 1995. [33] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky. Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pages 28–34. Association for Computational Linguistics, 2011. [34] J. Li, L. Sun, C. Kit, and J. Webster. A query-focused multi-document summarizer based on lexical chains. In Proceedings of the Document Understanding Conference, Rochester. NIST, 2007. [35] C. Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81, 2004. [36] C. Lin and E. Hovy. From single to multi-document summarization: A prototype system and its evaluation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 457–464. Association for Computational Linguistics, 2002. [37] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In S. S. Marie-Francine Moens, editor, Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. [38] A. Louis and A. Nenkova. Automatically evaluating content selection in summarization without human models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 306–314. Association for Computational Linguistics, 2009. [39] H. P. Luhn. The automatic creation of literature abstracts. IBM Journal of research and development, 2(2):159–165, 1958. [40] K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, 28(2):203–208, 1996. [41] L. Ma, T. He, F. Li, Z. Gui, and J. Chen. Query-focused multi-document summarization using keyword extraction. In Computer Science and Software Engineering, 2008 International Conference on, volume 1, pages 20–23. IEEE, 2008.

63 [42] I. Mani. Summarization evaluation: An overview. 2001. [43] I. Mani and M. Maybury. Advances in automatic text summarization. MIT press, 1999. [44] R. Mihalcea. Language independent extractive summarization. In Proceedings of the ACL 2005 on Inter- active poster and demonstration sessions, pages 49–52. Association for Computational Linguistics, 2005. [45] R. Mitkov et al. The Oxford handbook of computational linguistics. Oxford University Press Oxford, 2003. [46] H. Morita, T. Sakai, and M. Okumura. Query snowball: a co-occurrence-based approach to multi-document summarization for question answering. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 223–229. Association for Computational Linguistics, 2011. [47] A. Nenkova, R. Passonneau, and K. McKeown. The pyramid method: Incorporating human content selec- tion variation in summarization evaluation. ACM Transactions on Speech and Language Processing (TSLP), 4(2):4, 2007. [48] V. Ng, S. Dasgupta, and S. Arifin. Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 611–618. Association for Computational Linguistics, 2006. [49] W. Ogden, J. Cowie, M. Davis, E. Ludovik, S. Nirenburg, H. Molina-Salgado, and N. Sharples. Keizai: An interactive cross-language text retrieval system. In Proceeding of the MT SUMMIT VII workshop on machine translation for cross language information retrieval, volume 416, 1999. [50] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: bringing order to the web. 1999. [51] B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, page 271. Association for Computational Linguistics, 2004. [52] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002. [53] D. Radev, H. Jing, M. Stys,´ and D. Tam. Centroid-based summarization of multiple documents. Information Processing & Management, 40(6):919–938, 2004. [54] H. Saggion. Multilingual multidocument summarization tools and evaluation. In Proceedings of LREC, volume 2006, 2006. [55] H. Saggion, J.-M. Torres-Moreno, I. d. Cunha, and E. SanJuan. Multilingual summarization evaluation without human models. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 1059–1067. Association for Computational Linguistics, 2010. [56] G. Salton, A. Singhal, M. Mitra, and C. Buckley. Automatic text structuring and summarization. Information Processing & Management, 33(2):193–207, 1997.

64 [57] D. Shen, J. Sun, H. Li, Q. Yang, and Z. Chen. Document summarization using conditional random fields. In Proceedings of IJCAI, volume 7, pages 2862–2867, 2007. [58] D. Song and P. Bruza. Discovering information flow using high dimensional conceptual space. In Proceed- ings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 327–333. ACM, 2001. [59] D. Song, P. Bruza, and R. Cole. Concept learning and information inferencing on a high dimensional semantic space. ACM/SIGIR, 2004. [60] C. Speier, J. S. Valacich, and I. Vessey. The influence of task interruption on individual decision making: An information overload perspective. Decision Sciences, 30(2):337–360, 1999. [61] X. Wan and J. Yang. Multi-document summarization using cluster-based link analysis. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 299–306. ACM, 2008. [62] D. Wang and Y. Liu. A pilot study of opinion summarization in conversations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL-2011), 2011. [63] M. Wasson. Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proceedings of the 36th Annual Meeting of the Association for Computa- tional Linguistics and 17th International Conference on Computational Linguistics-Volume 2, pages 1364– 1368. Association for Computational Linguistics, 1998. [64] M. J. Witbrock and V. O. Mittal. Ultra-summarization (poster abstract): a statistical approach to generating highly condensed non-extractive summaries. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 315–316. ACM, 1999. [65] J. Zhang, X. Cheng, and H. Xu. Gspsummary: a graph-based sub-topic partition algorithm for summariza- tion. Information Retrieval Technology, pages 321–334, 2008. [66] L. Zhao, L. Wu, and X. Huang. Using query expansion in graph-based approach for query-focused multi- document summarization. Information Processing & Management, 45(1):35–41, 2009. [67] Q. Zhou, L. Sun, and J. Nie. Is sum: A multi-document summarizer based on document index graphic and lexical chains. DUC2005, 2005.

65