Enhancing Summaries with Conceptual Spaces
Thesis submitted in partial fulfillment of the requirements for the degree of
Master of Science (by Research) in Computer Science and Engineering
by
Jayant Gupta 200802018 [email protected]
Search and Information Extraction Lab International Institute of Information Technology Hyderabad - 500 032, INDIA October 2013 Copyright ⃝c Jayant Gupta, 2013 All Rights Reserved International Institute of Information Technology Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Enhancing Summaries with Conceptual Spaces” by Jayant Gupta, has been carried out under my supervision and is not submitted elsewhere for a degree.
Date Adviser: Prof. Vasudeva Varma To Curiosity Acknowledgments
First and foremost, I wish to thank Prof. Vasudeva Varma for being my advisor and guide to my research work. His presence gave me support, his advice gave me direction and his belief gave me motivation to pursue research with utmost dedication. I thank Sudheer Kovelamudi for advising me at my initial stages of research work. I thank Aditya Mogdala, Kushal Dave, Sambhav Jain and Nikhil Priyatam for giving me valuable feedback and guid- ance during the most critical times of my research. I thank Riayz Ahmad Bhat who helped me to develop my writing skills. I thank Sarvesh Ranade with whom I worked and had a great learning experience dur- ing the final stages of my research. I thank my batchmates for having spent some great times with them during the course of my stay at IIIT. I thank Akshay Mani Agarwal, for being a brother when I needed one. I thank Sports Fraternity of IIIT, especially Kamalakar sir, to help me nurture my passion for sports. I thank all the members of SIEL lab especially, Ajay Dubey and Harshit Jain for making the period of research joyous and satisfactory. Finally I thank my parents for being there in my support and having faith in me during my research. They gave me freedom and stood by my decisions. In the end their patience with me helped to give justice and quality to my research work.
v Abstract
Library science is the predecessor of the present information retrieval (IR) technology. Since decades, libraries were the source of information and knowledge to one and all. A library meant a mammoth structure, home to thousands of books and journals. People from far away lands came to great libraries where they used to learn and later contributed. These people then became the source of information to other people. In present society, we will find a shift in paradigm. Today, thousands of books can be stored on a hand held device or a personal computer like our own personal library. Furthermore, Internet has given the capability of easy accessible knowledge to every person. With age internet services have matured and people are more comfortable to share and contribute content via this medium. This resulted in multiple sources of information pertaining to any single topic. There is no bound on the language of each of these sources. So, now we have many sources in multiple languages. This has led to a whole new set of problems that are needed to be solved by the IR community. The main focus of these problems is management of huge information. They need to figure out what different methods can be used to understand and impart structure to the information. Furthermore, individual needs play an important role to decide the information managing strategies. So, the focus has shifted from getting the information to getting the right information. This work is a step in that direction. We address the problem of Text Summarization and its multi- lingual solution. Although text summarization is a relatively older problem, the internet age has given a new direction and importance to it. Therefore, the summarization methods are needed to be improvised and novel solutions are needed. In our methodology, we initially focus on changing the heuristic based representation of text to meaningful representation of the text. We have used Hyperspace Analogue to Language (HAL) to represent the text, it is a computational model based upon a cognitive model called Conceptual Spaces. The properties of conceptual spaces allow us to represent words and sentences in the same space, called HAL space. Then, we modeled the problem of summarization as selecting those set of sentences which can represent the source text in the most meaningful manner. To handle the redundancy in summaries we propose a novel mechanism which is effective in the HAL space. Our method is language independent making it scalable over different languages. We provide useful insights into formation of conceptual space using textual examples and behavior of metrics using intrinsic experiments. Intrinsic experiments and extrinsic evaluation were conducted on DUC 2001 and DUC 2002 datasets. The results of extrinsic
vi vii evaluation show that quality of summaries is preserved over summary size and the system outperforms, previous state-of-the-art systems for longer summaries while being comparable for shorter summaries. Multilingual summarization is a relatively new field in text summarization. We focused on studying two aspects of multilingual summarization first, “Added noisy information” (related to number of lan- guages of the source documents) and second, suitability of monolingual summarizers in a multilingual domain. For our work, we use automatic translation systems along with four generic summarizer sys- tems (including CMDS). These summarizers are used to form monolingual summaries (separately) in different languages. Quality of a summary (for each language) is obtained by the Jensen-Shannon diver- gence measure between the summary distribution and input distribution. To form multilingual summary, weights proportional to the quality are used to combine the monolingual summaries. This work is done in three languages, namely English, Hindi and Telugu. The experimental results are encouraging and show that as the number of interacting language increase quality of multilingual summaries improve. We also find that compared to structural methods, contextual methods are more suitable for the task of multilingual summarization. Finally, to show that HAL features are effective for different summarization tasks other than generic summarizers, we use them as one of the key features to form summaries of on-line conversation in the domain of debates. The experimental results (ROUGE scores) show that our summaries are better compared to previous state-of-the-art system. One major difference between our approach and previous approach was the use of HAL features to create summaries. This shows that addition of HAL features to sentiment related features is helpful to summarize sentiment rich text. To conclude, we explain the need of meaningful representation of text to improve summary quality. Our work establishes HAL as a quality representation of text and useful for the task of summariza- tion. We also give a summary formation conjecture and the summaries thus formed are highly efficient which improves as the size of summary increase. We also show that multilingual summarization is not only needed but is useful to solve the problem of information overload. Our work brings out various challenges involved in the task of multilingual summarization especially, the evaluation of multilingual summaries. This work adds the component of multilingual summarization to the solution of information overload. Contents
Chapter Page
1 Introduction ...... 1 1.1 Generic Text Summarization ...... 2 1.2 Multilingual Summarization ...... 3 1.3 Evaluation of Summaries ...... 4 1.4 Problem Description ...... 5 1.5 Overview of our approach ...... 5 1.5.1 Generic Summarization ...... 5 1.5.2 Multilingual Summarization ...... 6 1.5.3 Summarization of Online Conversations in the domain of Debates ...... 6 1.6 Contributions of this work ...... 7 1.7 Thesis Organization ...... 7
2 Related Work ...... 9 2.1 Types of Summarization ...... 9 2.2 Generic Summarization ...... 11 2.2.1 Feature based methods ...... 11 2.2.2 Graph based methods ...... 11 2.2.3 Lexical chain based methods ...... 12 2.2.4 Other relevant methods ...... 12 2.2.5 HAL based methods ...... 13 2.3 Multilingual Summarization ...... 13 2.4 Summarization of Online Conversations in the domain of Debates ...... 14 2.5 Summary Evaluation ...... 15 2.5.1 ROUGE ...... 15 2.5.1.1 ROUGE-N ...... 15 2.5.1.2 ROUGE-L ...... 16 2.5.1.3 ROUGE-SU* ...... 16 2.5.2 Jensen-Shannon divergence ...... 17 2.6 Concluding Remarks ...... 18
3 Multi-Document Summarization Using Conceptual Spaces ...... 19 3.1 Motivation of our Approach ...... 19 3.2 Text Representation Overview ...... 20 3.3 Conceptual spaces as a representative model ...... 21 3.3.1 Gardenfors’¨ Conceptual Spaces ...... 21
viii CONTENTS ix
3.3.2 Forming Conceptual Spaces using HAL ...... 22 3.3.3 Sentences in Conceptual Space ...... 23 3.4 Conceptual Multi-Document Summarization (CMDS) ...... 26 3.4.1 Principle ...... 28 3.4.2 Metrics ...... 29 3.4.3 Redundancy Removal ...... 29 3.5 Experimental Setup ...... 30 3.6 Results and Discussion ...... 32 3.6.1 Intrinsic Experiments ...... 32 3.6.1.1 Effect of variable window size: ...... 32 3.6.1.2 Effect of variable metrics: ...... 32 3.6.2 Extrinsic Evaluation ...... 33 3.7 Summary and Conclusion ...... 35
4 Multilingual Multidocument Text Summarization ...... 38 4.1 MultiLingual Summarization using Jensen-Shannon Divergence ...... 39 4.1.1 Translation ...... 40 4.1.2 Generic Summarizers ...... 40 4.1.3 Jensen-Shannon (JS) Divergence ...... 41 4.1.4 Final Summary ...... 41 4.1.4.1 Redundancy Removal ...... 42 4.2 Dataset and Evaluation Metric ...... 44 4.3 Experiments ...... 44 4.4 Results and discussion ...... 45 4.5 Conclusion and Future Work ...... 46
5 Summarization of Online Conversations in the domain of Debates ...... 48 5.1 Approach Used ...... 48 5.1.1 Calculating Topic Relevance ...... 49 5.1.1.1 Topic Directed Sentiment Score ...... 49 5.1.1.2 Topic Co-occurrence Measure ...... 51 5.1.2 Calculating Document Relevance ...... 51 5.1.3 Calculating Sentiment Relevance ...... 51 5.1.4 Positional and Coverage Relevance ...... 52 5.1.4.1 Sentence Position ...... 52 5.1.4.2 Sentence Length ...... 52 5.1.5 Calculating Relevance of a Dialogue Act ...... 52 5.2 Experimental Setup ...... 53 5.3 Results and Discussion ...... 53 5.4 Conclusion and Future Work ...... 56
6 Conclusions ...... 57
Bibliography ...... 61 List of Figures
Figure Page
3.1 Concept combination in a 3-dimensional conceptual space where the combined concept is more refined...... 23 3.2 Schematic-overview of the complete system ...... 27 3.3 Representation of documents and summary in a 3-dimensional conceptual space. . . . 28 3.4 Summary quality v/s window size (K) ...... 32 3.5 Summary quality v/s xth root of rank metric ...... 33 3.6 Summary quality v/s yth root of weight metric ...... 34 3.7 Graphical representation of ROUGE scores for all the systems ...... 36 3.8 Final CMDS Summary for DUC2001: d15c ...... 37
4.1 Added Noisy Information ...... 39 4.2 Architecture of the system ...... 39
5.1 ROUGE-2 (Average F-measure) scores v/s Summary Size (in words) ...... 55
x List of Tables
Table Page
2.1 Types of Summarization ...... 10
3.1 Average F-measure (ROUGE-2) scores for various state-of-the-art systems ...... 35
4.1 JS Divergence of monolingual summaries ...... 45 4.2 JS Divergence of bilingual summaries ...... 45 4.3 JS Divergence of trilingual summaries ...... 45
5.1 Argument Structure Examples ...... 49 5.2 List of Dependency Relations ...... 50 5.3 Statistics of the dataset ...... 53 5.4 ROUGE Scores (Average F-measure) of System Summaries (1000 words) ...... 55
xi Chapter 1
Introduction
The World Wide Web provides a pool of knowledge where information on any topic is present in abundance. With the evolution of internet, accessibility of information has increased and people are becoming comfortable with digital information. They have started contributing by means of social blogs, articles and through on-line social media. Moreover, internet has been able to dissolve boundaries (virtually) of nations and languages. People with varied linguistic preferences are accessing the web and contributing in their preferred language. The accessibility is leveraged by the advancement in information retrieval techniques. Easy accessibility to such large information is leading to the problems of information overload. Ac- cording to Spier et al. [60] information overload occurs when the amount of input to a system exceeds its processing capacity. Decision makers have fairly limited cognitive processing capacity. Consequently, when information overload occurs, it is likely that a reduction in decision quality will occur. There- fore, quality text tools which can help in management of information have become an important need of modern information retrieval systems. In real applications, Google news 1 shows small snippets. They help the readers to decide if the news is important enough to read. Modern search engines like Google and Bing 2 also show info-boxes which are a small window to the complete search results. For many users that info-box might suffice thereby fulfilling their information need with lesser information. This also means less unwanted information for the user. Thus, any tool which can give a shorter yet accurate description is helpful to manage information. According to Wikipedia3 India ranked 3rd in the number of internet users after United States (2nd) and China (1st). Ironically, India ranked 164th in internet penetration (12.6%) way behind United States (28th) and China (102nd). Presently there are 22 languages recognized by the Constitution of India. According to 2001 Census of India 10.35% of total Indian population were English speakers. The 2005 India Human Development Survey (from surveyed households) reported that among men 72% do not speak English, 28% speak at
1http://www.news.google.com 2http://www.bing.com 3http://en.wikipedia.org/wiki/List of countries by number of Internet users
1 least some English, and 5% are fluent. Among women, the corresponding proportions were 83%, 17% and 3%. Low penetration yet large users suggest vast future opportunities for web based services in India. The statistics show that an increment in internet penetration is required. The physical requirements (infrastructure, fiber cables, wireless connectivity and devices) for the penetration purposes are out of scope for the Information Retrieval community. However, our major concern is the services which can be provided once connectivity is improved in remote areas. Web based services which can cater to the local linguistic requirements of these people would be in demand. Tools which are scalable over multiple languages would be required for information management purposes. Use of variable languages will add the multi-lingual aspect to the problem of information overload. This work focuses on the field of Text Summarization and its multi-lingual solution. Summarization is one of the key fields in information retrieval domain and, is used to manage large set of information.
1.1 Generic Text Summarization
Generic text summarization refers to the generation of summaries computationally which cover the most important points of the source document(s). An efficient summary gives a succinct non-redundant overview of documents without expanding on specific details. Automatic text summarization is a com- plex and challenging area and significant research has been done in the area of text summarization. Pre- vious work can be categorized into different types depending on the way the summaries are generated. Some of these include extractive vs. abstractive, single document vs. multi-document, language specific vs. multi-lingual, query dependent vs. query independent, supervised vs. unsupervised, etc [43]. Principally, unsupervised summarization approaches are broadly classified into graph based, feature based and lexical chain based approaches [25]. Graph based approaches [25, 61, 11] depend upon the rationale that similar sentences should contain identical words. Feature based approaches [10, 32] depend upon all the characteristics which can be used to distinguish two textual entities. Lexical chain based approaches [3, 67, 34] create lexical chains using available knowledge sources (like wordnet [13]). Over the years, the problem has been modeled in various different forms resulting in different meth- ods to solve it. Initial approaches were based upon sentence extraction; later approaches incorporated various language specific features. The additional features made the summaries more robust. Also, ad- vancements in natural language generation allowed automatic sentence creation. This led to building of abstractive summarization techniques. Witbrock et al. [64] use extraction to obtain important summary words and then use a bi-gram language model to form sentences. Other approaches use shortening of sentences using sentence reduction rules. Knight et al. [30] use expectation maximization to compress the syntactic parse tree of a sentence. The tree is used to produce shorter but grammatically consistent version of summary sentences. During the whole time, the means to access the information changed by the introduction of world wide web in early 90’s. This led to a revised emphasis on the problem of text summarization to tackle
2 the problems of information overload. With web came different variants (social media based, etc) of use-cases where a summarization system could be used. Accordingly, summaries were created and summarization methods evolved. A series of highly successful summarization meetings have been held in past. Amongst them TAC 4 (Text Analysis Conference) has been the main evaluation forum for research in text summarization. It was previously known as Document Understanding Conference and began in the year 2000. Various Summarization tasks ranging from non-extractive summarization, spoken language (including dialogue) summarization, language modeling for text and speech summarization, multi-document and multilingual summarization, integration of question answering and text summarization, Web-based summarization, evaluation of summarization systems, etc were worked upon during the course of DUC/TAC workshops. This resulted in a wide range of high quality generation and evaluation methods. The datasets used to evaluate the systems are often used as benchmark to evaluate any given summarization system.
1.2 Multilingual Summarization
Internet is accessible to all, irrespective of their language. This has resulted in an extensive availabil- ity of textual data with linguistic diversity. Reading through all this information that is spread across languages is difficult. So, an efficient way to summarize information distributed in multiple languages is needed. Multilingual text summarization is the problem of producing summaries in a language T when the input contains the documents in language S different from T along with documents in language T , or when the input to the summarizer consists of automatic translations in language T of documents in language S [54]. This is a challenging problem because summaries produced from automatic trans- lations, using noisy input would have additional problems to those of lack of cohesion and coherence usually reported in text summarization research [43]. Extractive summarization requires scoring of sentences based on its importance. Scoring is done using various features (language independent) like term distribution, frequency patterns, position of sentence, length of the sentence, sentence similarity, etc. These features are effective when all the text is in one language. However, additional features are required if the text contains words in different language, especially word level features [6]. A major problem is to identify the words which have similar meaning in different languages. The most likely solution is to use language specific tools to translate and transliterate the text. In this case, the accuracy of translation and transliteration systems becomes a critical issue. Furthermore, the availability of these tools for languages having less resources is an additional problem. Most of the previous approaches use clustering and translation of documents to form summary. The basic idea of these techniques [5, 12, 6] is that they collect similar information together by clustering techniques for every language. Then, they find similar clusters across different language by translating clusters in one language and identifying similar cluster in another language. Final output is produced in
4www.nist.gov/tac
3 user desired language by substituting all the different linguistic sentences with a similar sentence in the required language. In past, TAC organized a pilot task 5 related to multi-lingual summarization. They provided news documents and their corresponding (human) translations in 8 different languages. The task required the systems to be able to summarize the documents in at least 3 different languages (independent of each other) with acceptable accuracy. The task was not itself multi-lingual summarization but was framed out of the basic idea that a good generic summarizer must be able to produce summaries in different languages with acceptable accuracy.
1.3 Evaluation of Summaries
Summary evaluation is an important part of summarization field. Evaluation is difficult primarily because there is no ideal summary as such. Past studies [28] have shown that, human summarizers tend to agree only about 60% of the times, and in only, 82% of the cases humans agreed with their own judgement. Apart from the human biasness involved in the evaluation of summaries, such a manual evaluation is also expensive and time consuming in nature. There is always a possibility of system generating a better summary that is different from reference human summary used as an approximation to the ideal output summary. Automatic evaluation methods are of two types, first evaluate summaries using human models and second evaluate without human models. Comparison of human summaries (models) to evaluate its in- formativeness has been the more popular approach. For various summarization tasks in TAC, system summaries are evaluated using ROUGE [35] scores. ROUGE stands for Recall Oriented Understudy of Gisting Evaluation. ROUGE measures summary quality by counting overlapping units such as the n-gram, word-sequence and word pairs between system summaries and human modeled summaries. Usually overlap based evaluation methods suffer from the problems of human variability, analysis gran- ularity and semantic equivalence [47]. The variable unit6 sizes (to be compared) in ROUGE addresses the problem of analysis granularity. The problem of semantic equivalence and human variability is addressed by using multiple human summaries to evaluate system summaries. Evaluating summaries without human models is relatively new in the field of summary evaluation. It is often thought as an unreliable way to evaluate summaries without gold summaries. However, there are instances where generating human summaries can be a bigger challenge especially in multi-lingual summarization. The challenge is that the annotator should be proficient in all the languages (of source documents) for which the summary is being generated. So the expenses as well as the knowledge requirement to create manual gold summaries increases. Louis et al. [38] proposed Jensen-Shannon divergence metric to evaluate summaries without human models. This measure was found to be highly effective to measure quality of the summaries and showed high correlation with ROUGE scores [55].
5http://www.nist.gov/tac/publications/2011/presentations/Summarization2011 MultiLing overview.presentation.pdf 6A unit can be a word, collection of words, phrases, or sentence.
4 1.4 Problem Description
Extractive summarization is the task of building concise excerpt of a given set of documents on the same topic. The summary should be able to convey the sense of the complete document(s) and avoid redundancy. Furthermore, the input documents can be in different languages retaining their relevance to the common topic. Our task is to build a summarization method based on a rich, informative and meaningful text representation. The representation should be language independent, yet effective on different languages. Extending the system to perform multi-lingual summarization and show that the basic feature of the representation can be used to leverage the quality of a different summarization process.
1.5 Overview of our approach
Creating a summary without any inference is poor and worthless use of its source documents. A summary should be able to retain the overall sense as well as convey the inference of its source docu- ments. Earlier approaches to summarization lacked inferential properties depending mostly on heuristic representation. This lead to content rich sentences but their combination could not covey the same in- ference as the original source documents. In our approach we have modeled the inferential properties of the text and built a robust summarization system using this representation. The detailed study of the representation and summarization method comes under generic summarization. The next step involves the use of this system as a part of multilingual summarization. In this part we compare our system to other systems that are based on different representations. In the final stages of the problem we worked upon a specific summarization task of summarizing conversations in the domain of on-line debates using the basic feature of our text representation.
1.5.1 Generic Summarization
In our method we have used Hyperspace Analogue to Language (HAL) [58] to represent text (words, sentences). HAL is formed by capturing co-occurrence patterns across the text by limiting the size of patterns by a window size. All the patterns are accumulated together in the form of W × W (W is the number of distinct words in the dataset) matrix. Each cell (wtij) of the matrix represents the contextual strength between words i and j and each row is a vector in the conceptual space. The HAL representation allows creation of new points (senses) in the same space by combining existing points. This property is used to form sentence vectors in the same space. The representation of sentence vectors can convey the context of any sentence, and is highly unique which can be used to disambiguate two similar sentences. Sentences are ranked based on the number and strength of senses it can convey. The summary formed in this manner carries similar sense as the set of source documents. To address the redundancy issue, Each time a sentence is selected we rank the remaining sentences based on the senses not present in the selected sentence. This increases summary coverage and removes redundancy in the
5 summary. Our results show that using inferential information leverages the quality of summary which are an improvement over previous state-of-the-art systems.
1.5.2 Multilingual Summarization
A framework has been created in which generic summarizers can be used to accomplish the task of multilingual summarization. The framework is mainly used to study two aspects of multilingual summarization. The first aspect is the effect of added noisy information on the summaries and the second is to understand the type of methods which are more suitable to the task of multilingual summarization. In context of summarization, addition of noisy information refers to the process where a potentially relevant information is added to be summarized containing syntactic errors caused by the translation step. In our approach we have used on-line machine translation system by google7. The documents in a given language (say T ) are translated to all the other languages (say set S) for which summarization is required. So, each of the languages has translated documents from other languages (referred as “Added noisy information”). Then, we use our generic summarizer along with three other existing state-of-the- art summarizers to generate monolingual summaries independently. The objective to use four different summarizers is to understand that which of these techniques is suitable for multilingual summarization. We have also analyzed the approach which is more robust against the noise in the data due to translation. Combination of monolingual summaries to create final multilingual summaries is based on the qual- ity of each summary against their input. The quality assessment is done using Jensen-Shannon diver- gence. Final summary is the linear combination of monolingual summary parts, where size of each part is proportional to their qualities. Redundancy in multilingual multi-document summarization is even a bigger issue because overlap of information is higher. Most relevant information is often preserved across the articles related to a topic in variable forms. To address redundancy here, we have used Jaccard similarity, which measures the word overlap between the summary and a new sentence to be added. Experimentally calculated threshold are used, and sentences above the threshold are discarded.
1.5.3 Summarization of Online Conversations in the domain of Debates
To measure the effectiveness of HAL representation we used them in the task of debate summa- rization. Debates are different from chats and casual conversations as they are conversed in a formal manner and usually contains either of the two debating topics. We have used the usual sentence ranking approach, to rank Dialogue acts (smallest unit of debate). Ranking of each sentence is done by a weighted linear combination of its feature vectors. Features represent the topic8 dependency and sentiments of each unit. Other superficial features such a positional and coverage is also used to rank the sentences. Evaluation of final summaries is performed using
7www.translate.google.com 8Refers to the two opposing topics of the debate
6 ROUGE measures. Comparison of system summaries is done against probabilistic variant of HAL, which has shown that the tasks in which the input text is highly opinion rich, we cannot do away with opinion relevance features.
1.6 Contributions of this work
Following are the contributions made in the process of solving the problem, as defined earlier:
1. We built an effective generic summarizer which is comparable to the state-of-the-art approaches. The summarizer has following key contributions:
• Extending concept combination characteristic of conceptual space for defining sentences. • Proposition of heuristics to combine concepts which underline the extension differing from previous approaches. • A novel theoretical framework for summary formation, supported by experimentally esti- mated parameters.
2. We built a framework and its underlying steps to form summaries from multilingual text. The system has following key contributions:
• Using the system we studied the effect of language interaction on the summaries. Our results show that the quality of summaries improve as the number of interacting languages increase. • The system methodology differs from previous translation based clustering techniques. • The system has been successfully implemented and studied upon three languages viz. En- glish, Hindi and Telugu.
3. We built a summarization system using our generic summarizer in the domain of debates. The system outperforms the previous state-of-the-art systems. The system has following key contri- butions:
• Intermediate features of our summarizer were added to the proven sentiment features lever- aging the final summary quality.
1.7 Thesis Organization
Chapter 2 presents the literature survey on summarization. Initially, it describes various factors to classify summarization tasks that presented the types in a tabular form. Following which, it describes previous seminal works in the field of generic summarization. Then ,the section discusses relevant methods for summarizing multilingual documents and on-line conversations in the domain of debates. Summary evaluation using ROUGE and Jensen-Shannon divergence has been discussed in the final parts of the chapter.
7 Chapter 3 describes the usage of conceptual spaces for multi-document summarization. It explains the theory behind conceptual spaces given by Gardenfors and its use to represent text. Properties of HAL representations are described which helps in building sentence representations in the text. Then a conjecture to compare two summaries, based on, the overall sense of documents they convey, is de- scribed. Based on the conjecture we describe our algorithm to create summaries. It is followed by different set of experiments to estimate system parameters and compare our system to previous works. Chapter 4 describes our framework for multilingual multi-document summarization. It explains the notion of added noisy information and the necessity to observe its effect on the summaries. Then, system architecture is described which is followed by the description of all the generic summarizers used. Following this, we describe the method to form multilingual summaries using Jensen-Shannon divergence. It is followed by different experiments to understand the whether adding information (noisy) from different languages help and which summarizer is most suitable for multilingual summarization. Chapter 5 describes the summarization of on-line conversations in the domain of debates. It describes the approach to form rank based summaries where ranking is dependent on various features. Following which we describe the calculation of these feature values. In the experimental section of the chapter calculation of weights for each feature is described. It also compares our system to previous state-of- the-art systems and results show that our system is effective. Chapter 6 concludes the thesis explaining the work done and describing the results of the experi- ments. It discusses the relevance of inferential properties in a representation and its effect on multi- lingual summaries. It elaborates the utility of multilingual summarization and addition of information from different languages to leverage summary quality. It also provides the details of future work with respect to the thesis. 2
8 Chapter 2
Related Work
2.1 Types of Summarization
With the advancement of summarization field summary formation process has been classified based on various factors. Following factors are considered to be important to describe different types of summarization.
• Input factors: text length, number of documents, genre, external query, text language, summary model, text behavior.
• Purpose factors : who is the user, the purpose of summarization.
• Output factors: running text or headed text etc.
Summaries can be classified based on the number of source text (single text vs. multiple texts summa- rization). If the input contains documents in different languages then it can be classified as multilingual summarization else monolingual summarization. Based on the availability of a trained summary model summarization can be classified as supervised vs. unsupervised. New genres of text have appeared ranging from very short (like twitter), short (comments) to longer text (blogs, articles, news, etc). They are classified based on the language structure of the text sentences (formal vs. informal). Sometimes the text information is updated regularly like a news reporting a month long event. In this case summaries are classified as update vs. static. Depending on the need of the user an external query can be given for summarization resulting into classification of summaries as query dependent vs. query indepen- dent summarization. Summaries which contain the same sentences as that of the source documents are called extracts whereas summaries containing system generated sentences are called abstracts. Thus, depending upon the sentences in the summary the summarization task can be classified as abstractive summarization vs. extractive summarization. Table 2.1 describes different types of summarization re- sulting from varying input, purpose and output factors.
9 Criteria Types
• Extractive: Summaries contain sentences, phrases or words from the original text. The sentences are not modified and selected based upon their importance to the text. Sentence Selection • Abstractive: An internal semantic representation of the text is built and natural language genera- tion techniques are used to create a summary that is closer to what a human might generate.
• Single-Document: Summary formation from a single document.
Number of Documents • Multi-Document: Producing a single summary from related source documents. Handling redun- dant information is a challenge when dealing with multiple documents.
• Query-Dependent: The query constraint gives the information requirement for the summary. Query dependent methods weight the input text with respect to the query and final summary External Query contains highly weighted sentences. • Query-Independent: Usually referred as generic summarizers. They select sentences based upon their overall importance to the input text.
• Mono-Lingual: Input documents are in single language. Methods are highly efficient and can use deep natural language analysis to form final summaries.
• Multi-Lingual: Input documents are in multiple languages. This is relatively new field in sum- Input Language marization and maintaining acceptable level of quality over different languages is a challenge.
• Cross-Lingual: Input criteria is similar to mono-lingual summarization. However, they use lin- guistic information from other languages to leverage the summary quality.
• Formal Text: Input documents are news articles, blog articles and formal documents. Input sentences are well-formed and the documents (usually) are self-contained.
Sentence Structure • Informal Text: input documents are social media chats, on-line discussion forums and the com- ments section. Input sentences are malformed with high to some minimal use of slangs, colloquial phrases and abbreviations. These methods rely heavily on efficient preprocessing of the input text.
• Supervised: Knowledge models from documents and their corresponding summaries are learned. These methodologies are relatively recent and developing along with the development of machine learning techniques. Learning Based • Unsupervised: Previous summarization results or feedback is not used to create summaries. All summaries are formed from scratch and once formed are not used for any other summary forma- tion step.
• Static: Source information remains unchanged so a summary once formed remains same.
Document Behavior • Update: Source information undergoes change as the time progresses. The methods must take into account this change and update the summary accordingly. Novelty detection is a challenge in update summarization. Highly useful in news domain.
Table 2.1 Types of Summarization
10 2.2 Generic Summarization
2.2.1 Feature based methods
Extractive text summarization uses the sentences in the text to create summaries. Feature based methods use various features to rank the sentences in given document(s). Over the years position based and frequency based features are the most commonly used features. Earliest work on summarization by Luhn [39], used number of word occurrences and the relative position of keywords within the sentence. The sentence scores reflect the number of occurrences of key- words within a sentence and a linear distance between them due to presence of non-significant words. Later work [10, 32, 36], added features such as sentence position, topic signature, cue words, date anno- tation, etc. These features were used to score sentences and top sentences were selected for summary. In MEAD [53], Radev et al. used various sentence level features like sentence length, sentence position, query-overlap (if query is given) using cosine similarity. A set of keywords were extracted from documents and their occurrence in a sentence was used as a feature. Top sentence was highly valued in a document and sentence similarity with respect to the top sentence (in the document) was added as a feature. Clusters were created and sentences were scored using sentence and inter sentence features.
2.2.2 Graph based methods
Graph based approaches [66, 65, 56] represent text as a graph. Salton el al. [56] apply knowledge of text structure to do automatic text summarization by passage extraction. They model intra-document linkage pattern of a text as a graph where the edges are formed using cosine similarity. A greedy graph traversal technique is applied in chronological order to form summary. In TextRank [44], each document is represented as a graph of nodes that stand for sentences inter- connected by similarity (overlap) relationship. The overlap of two sentences is simply determined as the number of common tokens between the two sentences, normalized by the length of these sentences. Then modified graph based ranking algorithms, such as Pagerank [50], HITS [29] and PPF [21] are used to rank the nodes. Motivated by the fact that a document contains various topic themes with varying level of importance, Cluster-HITS [61] create topic clusters to identify the sentences based on same topics. A bipartite graph is formed between the clusters and sentences based on the cluster-sentence similarity (based on word overlap). Sentence scores are calculated by applying HITS on the graph. Top scored sentences are used to form the summary.
11 2.2.3 Lexical chain based methods
Lexical chain based approaches [3, 34, 67] create lexical chains to represent the text. Formation of lexical chains can be done using available knowledge sources (like wordnet [6]) and other lexical features. Barzilay et al. [3] compute lexical chains in a text, merging several robust knowledge sources the WordNet thesaurus, a part-of-speech tagger and shallow parser for the identification of nominal groups. Summarization proceeds in three steps the original text is first segmented, lexical chains are constructed, strong chains are identified and significant sentences are extracted from the text. Construction of chains is a generative process where edges are created based on semantic sense (given by WordNet) and strength is calculated using inter sentence distance and frequency of co-occurrence. Strength of a chain is decided based on its length and number of distinct members. Sentences which contain highly weighted chains are selected in the summary and redundancy is reduced by including all lexical chain members in the summary. Zhou et al. [67] used lexical chains in multi-document summarization system, IS SUM. The ap- proach was divided into 4 components: preprocessing, clustering, summarization and compression. Preprocessing step extracts relevant text from XML files. The text is marked with POS tags, words are stemmed and word frequencies are calculated. Clustering is done based on inter-document similarity, computed by combining cosine similarity and phrase similarity. Lexical chains were formed for each cluster and used to create Document Index Graphic. The chains containing more key-phrases (nouns and verbs) were given higher score. The strongest chains of each cluster are selected to create the summary once all the chains have been built. Compression is achieved by the use of Maximal Marginal Relevance (MMR). Li et al. [34] modified IS SUM and worked upon improving its lexical chain algorithm for effi- ciency enhancement, applying the WordNet for similarity calculation and adapting it to query- focused multi-document summarization.
2.2.4 Other relevant methods
Redundancy removal has been a big issue in summarization. Carbonnell et al. [4] proposed Maximal Marginal Relevance to create a balance between information novelty and importance to create non redundant summaries. Another approach based on latent semantic indexing, in which singular value decomposition (SVD) is used to decompose a term by documents matrix. The resultant eigen values are used to rank the sentences for generic text summarization [17]. A holistic summarizer HolSum [18] was proposed which starts from an initial summary 1 then used a standard hill climbing algorithm to select similar summaries such that the new summary is more similar to the original text. Recently, document summarization based on data reconstruction [20] has been proposed in which the document is reconstructed by the linear combination of the selected sentences. An optimization function is used to get the sentences that are most informative with minimal redundant information.
1Lead sentences are selected as initial summary
12 2.2.5 HAL based methods
Earlier use of HAL has been done primarily in query dependent summarization tasks. Motivated by the observation that metrics based on key concepts overlap give better results when compared to metrics based on n-gram and sentence overlap. Jagadeesh et al. [26] combined relevance based language modeling , Latent Semantic Indexing and special words to create summaries. Relevance of sentence for a query was calculated by adding the HAL scores between words in the sentence and words in the query. In a later work [27], features based on sentence importance, independent of the query were added. Theses features were calculated using external documents that were extracted from web using the given query. The addition resulted in further improvement of system performance. Ma et al. [41] score sentences through the importance of their words and modified MMR technology is used to adjust the score of the candidate sentence. Word importance is decided by its query dependent score and topic related score calculated by HAL scores and likelihood with respect to query terms respectively. He et al. [19] also use similar approach to produce summaries of relevant documents acquired based on user-feedback information and transductive inference SVM machine learning. Morita et al. [46] use an HAL like approach to generate query dependent summaries. A co-occurrence graph is built to obtain words that augment the original query terms and enrich the information need. Summarization problem is then formulated as a Maximum Coverage Problem with Knapsack Con- straints based on word pairs rather than single words. All these approaches focus on query dependent summarization where summaries are influenced by the query. HAL scores are used to compute the relevance of words to the query words. Sentences containing highly relevant words are selected to form summaries. Our work on generic summarization differs from these approaches entirely, because we do not have a query to formulate a summary. We have used HAL representation for its inferential properties deviating from previous uses which uses them to calculate query relevance. However, for a later work on on-line conversations in debate domain we made use of HAL in a similar manner.
2.3 Multilingual Summarization
Summarization for languages other than English has been done in Scandinavian languages [7] and SUMMARIST project [23] which includes Indonesian. Both the system implement various language independent features such as keyword (calculated automatically), term frequency, position and special text elements to score and rank sentences. SUMMARIST also employs an optimal position policy where positional scores were generated using a set of documents and their pseudo-ideal summaries. Top ranked sentences are used to form summary. The Keizai system [49] is a Cross Language Text Retrieval system having summarization as a feature that gives user an overall comprehension of a document. The system gives summary in Japanese and Korean languages using statistical and symbolic techniques. These summaries are translated to English and both summaries are displayed to the user.
13 Saggion et al. [54] used an English-Arabic alignment table to translate documents and then use centroid-based sentences extraction techniques to form summaries and the final output contains sen- tences in English language only. Columbia Newsblaster [12] performs multi-lingual summarization by translating the documents and then uses clustering based methods to generate summaries. They focus upon quality of summarization systems for a single language by shifting the majority of the multi-lingual knowledge burden to specialized machine translation system. A similar work has been done by Chen and Lin [6] which performs multi-lingual news summarization in Chinese and English languages. Our method differs from previous work because added noisy data (translated) affects the quality of sum- mary. Thus, even though the summaries are generated in a given language it contains the information from other languages. This means that scores of a sentence is calculated with respect to all the sentences from all the languages.
2.4 Summarization of Online Conversations in the domain of Debates
In context of summarization of on-line conversations which are rich in opinions, identification of opinion containing sentences is important. Sentence relevance is further decided by their sentiment scores, topic relevance and other lexical and positional features. Earlier works mainly focused on re- views [51, 24, 48] which used lexical features (unigram, bigram and trigram), part-of-speech tags and dependency relations. Ku et al. [31] performed opinion summarization in news and blog domain. They propose opinion extraction at word, sentence and document level. For each new word, distribution of its characters (Chi- nese) as positive and negative polarity in the seed vocabulary (created manually) is used to determine sentiment of the word. These scores are compounded to compute sentence scores and then document scores. Presence of negation operators decided the sentiment tendency at sentence level which further propagated to document level. Wang et al. [62] performed opinion summarization on conversations. They used linear combination of features from different aspects including topic relevance, subjectivity and sentence importance to score sentences. They also proposed a graph based method, which incorporates topic and sentiment information, as well as additional information about sentence to sentence relations extracted based on dialogue structures. Summarization in the specific domain of on-line debates is a novel field. This domain differs from chatting and conversation because it is more formal and focuses on specific topics. It may be possible that the argument contains various different factual knowledge but they are usually related to one or the other topic. Similarly, it is different from news and blogs because it is comparatively more rich in sentiment. Therefore, summarization by opinion mining in debates is an interesting and challenging task.
14 2.5 Summary Evaluation
Evaluation of summary is necessary and advantageous to automatic summarization similar to lan- guage understanding technologies wherein, it can foster the creation of reusable resources and infras- tructure, it creates an environment for comparison and replication of results; it introduces an element of competition to produce better results [42]. System evaluation is done in two ways, intrinsic and extrin- sic. Intrinsic evaluation is done to assess the summarization system internally and extrinsic evaluation is done to assess the utility of summarization system on a real world task such as, reading comprehension tasks (answering a set of questions after reading the summary), relevance assessment tasks (evaluating whether the relevance of a summary to given topic is same as that of the source document) [45]. Both intrinsic and extrinsic evaluation are necessary and serve different purposes. Intrinsic evaluation is done to improve the system accuracy and polish the system results. Extrinsic evaluation is needed to understand the extent to which the system is able to accomplish a task involving summarization. Various methods have been proposed in both the directions involving different degree of human effort. There is a trade-off between amount of human work and effectiveness of the evaluation measure. Effort has been laid upon to automate the human effort or replicate human evaluation measures, thereby making evaluation process less expensive. We present two evaluation measures, ROUGE [35] which stands for Recall Oriented Understudy for Gisting Evaluation and Jensen-Shannon divergence [38]. The first method uses human reference summaries to evaluate the system and therefore, is expensive. However, it is effective and has been used extensively to evaluate the results in TAC conferences. The second measure involves no human effort and evaluates the summary based on its divergence from the input set of documents. This measure is useful for evaluation for multi-lingual summarization as manual effort in this case requires more effort as well as skillful annotator.
2.5.1 ROUGE
ROUGE [35] stands for Recall Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. We describe ROUGE-N, ROUGE-L and ROUGE-SU* measures used for evaluation purposes in our work.
2.5.1.1 ROUGE-N
This measure is an n-gram recall between a candidate summary and set of reference summaries. ∑ ∑ Countmatch(gramn)
Sϵ{ReferenceSummaries} gramnϵS ROUGE − N = ∑ ∑ (2.1) Count(gramn)
Sϵ{ReferenceSummaries} gramnϵS
15 Where n stands for the length of the n-gram, gramn and Countmatch(gramn) is the maximum number of n-grams co-occuring in a candidate summary and the reference summaries.
2.5.1.2 ROUGE-L
This measure uses the longest common subsequence between candidate summary and reference summary to estimate the similarity between the two summaries. This measure effectively captures the sentence level structure. Let X be the reference summary sentence of length m and Y be the candidate summary sentence of length n. Then, Recall, Precision and F-measure are calculated in the following manner. LCS(X,Y ) R = (2.2) m LCS(X,Y ) P = (2.3) n 2RP F = (2.4) R + P Where, LCS(X,Y ) is the longest common subsequence between X and Y .
2.5.1.3 ROUGE-SU*
This measure uses the skip-bigram 2 count between candidate summary and reference summary to estimate the similarity between the two summaries. This measure is sensitive to word-order without requiring consecutive matches. Unigram matches are also included in this measure, to give credit to a candidate sentence if the sentence does not have word pair co-occurring with its reference. Recall, Precision and F-measure are calculated in the following manner. SKIP 2(X,Y ) + Count (unigram) R = match (2.5) C(m, 2) + m
SKIP 2(X,Y ) + Count (unigram) P = match (2.6) C(n, 2) + n
2RP F = (2.7) R + P Where, SKIP 2(X,Y ) is the number of skip-bigram matches between X and Y . C(m, 2) and C(n, 2) are the total skip bi-grams in the two sentences. Spurious matches are reduced by limiting the maximum skip distance dskip, between two in-order words. Accordingly, the denominators of 2.5 and 2.6 are calculated. In TAC (DUC), dskip is set to 4 and is deemed sufficient to capture summary similarity reliably. For our work, we have used F scores for evaluation purposes, as they represent both precision and recall aspects, for different matches: unigram (ROUGE-1), bigram (ROUGE-2), longest subsequence (ROUGE-L) and skip-bigram with unigram (ROUGE-SU*).
2Skip-bigram is any pair of words in their sentence order, allowing for arbitrary gaps.
16 2.5.2 Jensen-Shannon divergence
Louis et al. [38] evaluated various measures for content selection evaluation in summarization that does not require the creation of human model summaries. Three different types of measures were studied as follows:
1. Distributional Similarity based: Based on the assumption that good summaries are characterized by low divergence between probability distribution of words in the input and summary, and by high similarity with the input. Experiments were done using Kullback Leibler Divergence, Jensen Shannon Divergence and cosine similarity measures.
2. Summary Likelihood: Likelihood of a word appearing in the summary is approximated as being equal to its probability in the input. Summary’s unigram probability and probability under a multinomial model were calculated.
3. Use of Topic words in the summary: Summaries containing topic signatures during content se- lection have been usually considered better. Thus, coverage based and common topic signatures were calculated between summary and the input.
The results showed that Jensen-Shannon divergence, which measures word distribution dissimilarity between the summary and the input performed the best out of all the measures. Saggion et al. [55] further studied the JS divergence and found positive medium to strong correlation among system’s ranks produced by ROUGE and divergence measures that do not use the model summaries. JS divergence between two probability distribution (P and Q) is given by,
1 J(P ||Q) = [D(P ||A) + D(Q||A)] (2.8) 2 where,
∑ p (w) D(P ||Q) = p (w)log P P 2 p (w) w Q
P + Q A = 2
Here, P represents summary and Q represents input documents to which summary is compared.
Furthermore, pP (w) and pQ(w) represent probability of word w in P and Q respectively.
17 2.6 Concluding Remarks
Throughout the literature survey we came across various methods to summarize the text. The meth- ods depend upon the type of summarization problem they are trying to solve. Classification of sum- maries depend upon the input, purpose and output factors. For example, query dependent summarization techniques use the given query to select relevant sentences. Approaches also use the inherent charac- teristics of the input text. Like, monolingual summarization has many approaches which makes use of language dependent morphological and dependency features (especially lexical chain based methods). Some recent summarization methods [57] use conditional random fields (CRF) built upon part of speech tags of the text to built the appropriate summarization models. In context of multilingual summarization, methods either use translation or transliteration or both before summarizing the documents. Most approaches follow cluster and map approach to build sum- maries. Work has been done upon Scandinavian, Chinese and Arabic languages. The field is relatively new and depend upon the quality of translation systems. Thus, an improvement in translation systems will help to improve this field. Another issue with multilingual summarization is that of evaluation. Manual creation of multilingual summaries is a very expensive process and requires highly skillful an- notators. However, the task of multilingual summarization is necessary to build summarization systems which can be accountable to the information distributed among different languages.
18 Chapter 3
Multi-Document Summarization Using Conceptual Spaces
3.1 Motivation of our Approach
Usually, summarization approaches are based on frequently observed heuristics in the data. Heuris- tics are patterns such as, similar sentences contain similar words; top (position-wise) sentences are more important to summary; highly frequent words are more important, etc. However, representing sentences with features like position, frequency of constituent words and length can not show its meaning. Fur- thermore, similar sentences based on word overlap are limited to conveying statistical similarity rather than semantic similarity. These issues can be effectively handled, if the representation contains contex- tual and inferential information. Earlier, features like latent dimensions [9] and lexical chains uncover the word-word and word-doc similarity. However, latent dimension cannot be explained in a symbolic sense. Whereas, lexical chains are dependent on availability of reliable and efficient knowledge sources. Development of such resources is itself a major research problem in many languages.
Heuristics reflect a thought process and high occurrence of such patterns across multiple text justify the existence of a corresponding cognitive process. However, justification of existence does not provide insights into the actual cognitive process. For example, if we come across a textual entity of the form ”1 + 1 = 2”, heuristics can identify similar patterns like ”x + y = z”. The pattern suggests that, “+” repre- sents an existing but undefined process. In reality, “+” represents accumulation (here) which cannot be identified using heuristic features. Therefore we need imbibe cognitive knowledge in our representation to bring out the underlying meaning of a pattern. The incorporation of such knowledge could prove helpful in summarization because we can select sentences which are more meaningful for the summary. The selected sentences are inferentially rich, which gives a better understanding to what the documents are all about.
19 3.2 Text Representation Overview
We use HAL representation which represents text in a highly meaningful manner. In this Section we briefly describe the algorithm to form HAL vectors. The description is done for the below sentence 1
“The aggressive identification and treatment of HIV-infected intravenous drug users with latent tu- berculous infection is therefore of both clinical and public health importance,” wrote Dr. Peter A. Selwyn of Montefiore Medical Center in New York.
Step 1: A co-occurrence matrix is formed for the complete dataset (Section 3.3.2). In this matrix all the rows represent a single vector, some rows in the matrix are shown below. 2 aggressive [10] people: 5.0, identification: 5.0, workers:4.0, treatment:4.0, care:3.0, hiv:3.0, health:2.0, infected:2.0, live:1.0, intravenous:1.0, identification [10] aggressive:5.0, treatment:5.0, people:4.0, hiv:4.0, workers:3.0, infected:3.0, care:2.0, intravenous:2.0, drug:1.0, health:1.0, treatment [62] tuberculosis:26.0, drug:12.0, cases:12.0, americans:12.0, mandatory:10.0, undergo:8.0, providing:8.0, people:7.0, tb:7.0, patients:7.0, hiv [76] infected:28.0, tb:28.0, virus:20.0, percent:17.0, people:16.0, aids:13.0, carried:12.0, twenty:9.0, drug:9.0, causes:9.0, infected [121] tb:34.0, people:34.0, million:29.0, aids:28.0, tuberculosis:28.0, hiv:28.0, bacteria:28.0, virus:19.0, whites:15.0, americans:15.0, intravenous [10] drug:5.0, infected:5.0, hiv:4.0, users:4.0, treatment:3.0, latent:3.0, identification:2.0, tuberculous:2.0, aggressive:1.0, infection:1.0, drugs [24] six:9.0, months:8.0, developing:7.0, countries:7.0, compliance:7.0, major:7.0, combat:5.0, cost:5.0, person:5.0, tuberculosis:5.0, users [18] drug:10.0, study:6.0, latent:5.0, methadone:5.0, tuberculous:4.0, conducted:4.0, intravenous:4.0, program:4.0, infection:3.0, infected:3.0, latent [18] percent:7.0, drug:6.0, users:5.0, tuberculous:5.0, tb:5.0, americans:5.0, infections:4.0, infection:4.0, clinical:3.0, intravenous:3.0, tuberculous [10] latent:5.0, infection:5.0, users:4.0, clinical:4.0, drug:3.0, public:3.0, intravenous:2.0, health:2.0, infected:1.0, importance:1.0, infection [104] tuberculosis:12.0, tb:12.0, whites:12.0, aids:10.0, stead:10.0, drug:8.0, blacks:8.0, hiv:8.0, nursing:8.0, diseases:8.0, clinical [10] infection:5.0, public:5.0, health:4.0, tuberculous:4.0, latent:3.0, importance:3.0, users:2.0, wrote:2.0, drug:1.0, dr:1.0, public [25] health:15.0, officials:8.0, clinical:5.0, states:5.0, tuberculosis:5.0, trend:5.0, united:4.0, infection:4.0, epidemic:4.0, importance:4.0, health [162] officials:34.0, tuberculosis:32.0, department:30.0, aids:17.0, public:15.0, epidemic:13.0, dr:11.0, city:11.0, federal:10.0, year:9.0, workers:9.0, importance [10] health:5.0, wrote:5.0, public:4.0, dr:4.0, clinical:3.0, peter:3.0, infection:2.0, selwyn:2.0, tuberculous:1.0, montefiore:1.0, wrote [10] importance:5.0, dr:5.0, health:4.0, peter:4.0, public:3.0, selwyn:3.0, clinical:2.0, montefiore:2.0, infection:1.0, medical:1.0, dr [110] tuberculosis:15.0, health:11.0, george:10.0, director:10.0, commissioner:9.0, myers:9.0, whites:8.0, jr:6.0, blacks:5.0, william:5.0, peter [10] dr:5.0, selwyn:5.0, wrote:4.0, montefiore:4.0, importance:3.0, medical:3.0, health:2.0, center:2.0, public:1.0, york:1.0, selwyn [10] peter:5.0, montefiore:5.0, dr:4.0, medical:4.0, wrote:3.0, center:3.0, importance:2.0, york:2.0, people:1.0, health:1.0, montefiore [10] selwyn:5.0, medical:5.0, peter:4.0, center:4.0, dr:3.0, york:3.0, people:2.0, wrote:2.0, importance:1.0, carry:1.0, medical [29] montefiore:5.0, center:5.0, legislative:5.0, submitted:5.0, beds:5.0, wards:5.0, welfare:4.0, selwyn:4.0, proposal:4.0, patients:4.0, center [19] tuberculosis:5.0, medical:5.0, york:5.0, disease:5.0, informed:5.0, people:4.0, lung:4.0, montefiore:4.0, diena:4.0, increase:3.0, york [46] city:16.0, tuberculosis:15.0, cases:12.0, aids:8.0, tb:8.0, united:6.0, medicine:5.0, blacks:5.0, people:5.0, population:5.0,
Step 2: Next, we perform vector addition on all the words of a given sentence (Section 3.3.3). This results in the sentence vector shown below,
1DUC-2001 dataset d15c, DOC NO: AP890302-0063 2The rows are relevant to the example sentence.
20 <( tuberculosis : 110.000 ) (tb: 82.000), (people: 68.000), (aids: 65.000), (hiv: 64.000), (drug: 54.000), (health: 45.000), (officials: 42.000), (americans: 41.000), (infection: 41.000), (infected: 38.000), (department: 38.000), (million: 38.000), (percent: 38.000), (public: 37.000), (bacteria: 37.000), (treatment: 35.000), (dr: 34.000), (latent: 34.000), (users: 33.000), (cases: 31.000), (tuberculous: 30.000), (study: 29.000), (clinical : 29.000), (whites : 27.000) >
Step 3: The resultant sentence vectors are normalized across the dimensions (words) over all the sen- tences. Summaries are formed using the resultant normalized sentence vectors. Some words are in boldface, so that readers can relate the representation to the sentence. Decimals show weight given to the word. < (peter: 0.192), (tuberculous: 0.156), (clinical: 0.156), (wrote: 0.144), (importance: 0.144), (intravenous: 0.118), (selwyn: 0.098), (users: 0.084), (latent: 0.074), (alcohol: 0.064), (iden- tification: 0.062), (philip: 0.059), (san: 0.052), (worried: 0.051), (aggressive: 0.049), (illness: 0.045), (day: 0.045), (panel: 0.045), (multiple: 0.043), (diseases: 0.043), (francisco: 0.042), (eliminate: 0.042), (public: 0.041), (rehabilitation: 0.041), (montefiore: 0.038) >
Observe that there are some words which are not in the sentence and can be seen in its representation. However, all these words are important to the context of this sentence and help us to infer various useful information about this sentence. High weight is given to words “peter”, “selwyn”, “montefiore” representing speaker of this sentence. This is followed by “tuberculous” which outlines the type of infection and the word “alcohol” which is equally harmful as drugs (suggested by another line from the same text). Sense of importance is identified by words “clinical” and “importance”. Word “aggressive” highlight the intensity of measures required for this case. Note that, “philip”, “san” and “francisco” have high weights. It is because d15c (set of documents) has a sentence with similar sense, which has been spoken by Dr. Philip C. Hopewell of San Francisco. The subsequent section 3.3 describes the theory, properties and creation of conceptual spaces with complete details. Section 3.4 describes the summary formation process from these vectors.
3.3 Conceptual spaces as a representative model
3.3.1 Gardenfors’¨ Conceptual Spaces
Conceptual space is one of the three levels of a cognitive model proposed by Gardenfors¨ [15]. Ac- cording to this model, cognitive representation can be done at three levels: symbolic, connectionist3 and conceptual. Symbolic representation tend to view every process as symbol manipulation, that can be modeled by Turing machines. Connectionist representation focuses on association between various elements. Conceptual representation derives sense from the geometrical structure of its elements. The overall relation amongst the three representations can be understood as follows: Given any represen-
3Connectionism is a special case of associationism which is modeled using artificial neural networks
21 tation, Symbolic level draws out the characteristics and functioning of all the symbolic entities. Then each symbol is connected to each other at connectionist level. This geometrical composition is used by conceptual representation to infer the sense, adding a meaning to the complete representation.
In more abstract terms, a conceptual space CS consists of a class of quality dimensions D1,D2, ...Dn. A quality dimension refers to a characteristic property which is important to describe any information uniquely. A point in CS is represented by a vector v =< d1, d2, ..., dn > with one index for each dimension[15]. Within this space, a concept is defined as a convex region. This means that all the points that are present in this region represent the same semantic sense with minor variations. Earlier, HAL has been successfully used to create conceptual spaces [58]. This motivated us to use HAL space to build the conceptual space from the documents. HAL is a representational model of semantic memory based on the intuition that, when humans encounter a new concept they derive its meaning from accumulated experience. This means that the meaning of a concept can be acquired through its usage with other concepts within the same context [40]. Throughout the following text we will use words “dimension” and “concept” interchangeably. This is done because HAL space is defined by words of the documents as its dimensions and these words represent a concept in itself (defined in this space). So, “dimension” refers to the role of word when it is acting as a building block. Whereas, “concept” refers to that role which defines its own meaning in the document.
3.3.2 Forming Conceptual Spaces using HAL
Given a lexicon of n words, HAL represents a n × n co-occurrence matrix in which each element contains the cumulative co-occurrence score between two words. Cumulative co-occurrence score is obtained by accumulating the scores between the two words over the whole document by moving a window of size K. The co-occurrence score between two words at a distance k is calculated as a product between (K − k + 1) and frequency of their occurrence at a distance k. Thus, Cumulative co-occurrence score between two words over the complete set of documents is given by,
∑K Score(wi|wj) = (nk ∗ (K − k + 1)) (3.1) k=1
where, nk is the frequency of occurrence of wi and wj at a distance k. HAL is direction sensitive as the co-occurrence information for words preceding each word and co-occurrence information for words following each words are recorded separately by row and column vectors [59]. Thus, the dimension of each word is 2n. Similar to [58], we do not consider the direction sensitivity of word pair. The row and column vector into one, reducing the dimension of each vector to n. Within HAL space, a concept is defined as a weighted vector.
ci =< wti1, wti2, ....wtin >
22 Figure 3.1 Concept combination in a 3-dimensional conceptual space where the combined concept is more refined. wtij is weight of the concept ci along dimension dj. Weight shows the strength of contextual similarity
that exists between ci and the concepts representing the dimension in the documents. Consider the fol- lowing example of a concept vector tuberculosis.
tuberculosis: < (cases: 108.0), (aids: 72.0), (people: 59.0), (bacteria: 42.0), (active: 41.0), (disease: 40.0), (health: 32.0), (risk: 29.0), (infected: 28.0), (percent: 27.0), (year: 27.0), (treatment: 26.0), (number: 25.0), (epidemic: 24.0), (united: 23.0), (tb: 23.0), (virus: 23.0), (case: 23.0), (vermund: 22.0), (reported: 21.0), (states: 19.0), (patients: 18.0), (years: 17.0), (morbidity: 17.0), (control: 17.0) >
It can be observed that tuberculosis is a bacterial health disease which is indicated by relatively high scores of “bacteria”, “disease” and “health” weights. Moreover, “cases”, “aids” and “people” are given higher weights as the documents4 talk about an increase in tuberculosis cases due to aids. This shows that HAL can preserve contextual meaning of a word along with its conceptual meaning and efficiently capture inferential characteristics of the documents. As a result, they form an effective implementation of conceptual space.
3.3.3 Sentences in Conceptual Space
An important characteristic of conceptual space is the ability to define new concepts by combining existing concepts. Figure 3.1 , shows the effect of concept combination in conceptual space. It can be seen that the combined concept envelopes more space in the conceptual space. This means that the new concept is more meaningful than the concepts composing it. This allows the formation of various meaningful concepts within the domain of our space. An implementation of this characteristic was proposed as a 4-step heuristic approach [58].
In this approach, two concepts c1 and c2 are combined by first re-weighting so that higher weights 5 are assigned to the dimensions of dominant concept (assume c1 here). Then, common dimensions were strengthened by a factor greater than 1. Strengthening ensures that a common dimension has more chance to become a quality dimension of the resultant concept. Then, the two concepts are composed
4DUC2001: d15c 5A dominant concept is the more significant amongst the two concepts.
23 together to form the resulting concept c1 ⊕ c2.
wtc1⊕c2i = wt1i + wt2i
Finally, the vector c1 ⊕ c2 is normalized so that they can be compared at same level. Our approach has similar motivations but we differ in our heuristics. Following are our underlying heuristic factors.
1. Heuristic of dominant concept is not used in concept combination because all the concepts are considered equivalent to each other. This allows the summary to be unbiased towards any concept based on an initial judgment.
2. Strengthening of overlapping dimensions is not done because concept combination takes place within relatively closer occurring words. In a given context, closer terms are usually used together most of the times. As a result, they share many common dimensions which automatically gets strengthened after combination.
3. Instead of normalizing each concept across its dimensions, we normalize each dimension along all the vectors. This serves us two purposes:
(a) Weight for every dimension is scaled between [0 - 1] . This makes the weights of different dimensions comparable to each other. (b) The concepts (in this case sentences) are now represented as left stochastic matrix. In a stochastic matrix, all columns sum to 1. Thus, our documents are now represented by a fixed point in the HAL space where all the dimensions have value 1.
Based upon above heuristics, we create sentence vector using following steps:
6 Step 1: Given a sentence si = w1, w2, w3..., wli with li words . The composition of all the words results into the representation of sentence in the conceptual space.
⊕ ⊕ ⊕ s = w1 w2 w3... wli (3.2)
Step 2: Sentence vectors are normalized along the dimensions as follows:
wtij ∀ { } wtij = ∑m jϵ 1, n (3.3) wtkj k=1 where, i denotes ith sentence and m is the number of sentences in the documents. The resultant sentence vector encapsulates the inherent meaning and context of its composition words.
6After removing stopwords.
24 This can be examined from the following example of a sentence and its vector in conceptual space. The doctors warned that besides being at risk of getting tuberculosis themselves, AIDS-infected addicts who carry the TB bacteria also may pass the germs to people they live with, to health care workers and other people.:7
< (germs: 0.106), (pass: 0.101), (live: 0.099), (special: 0.097), (care: 0.094), (aggres- sive: 0.077), (intervention: 0.065), (montefiore: 0.064), (identification: 0.062), (minor- ity: 0.060), (administer: 0.060), (contracting: 0.055), (myers: 0.054), (strategies: 0.054), (showing: 0.053), (scared: 0.052), (symptoms: 0.051), (selwyn: 0.049), (largely: 0.049), (review: 0.049), (added: 0.048), (elderly: 0.048), (capacity: 0.048), (warned: 0.047), (died: 0.047) >
In the above example words “aggressive”, “special”, “contracting” and “intervention” 8 do not occur in the sentence yet they are weighted highly in the vector. We observe the aforesaid, because in context of the documents, this sentence suggests that AIDS infected people having TB should be given special care. So, special measures like intervention should be taken by the officials to stop the disease from spreading further. This shows that the sentence vector obtained by combination has inferential charac- teristics. These are similar to the observations for a single word concept as a result, a sentence vector can be interpreted as a concept in the HAL space. Concepts obtained by combinations are more refined and are capable of disambiguating between mul- tiple contexts. The two properties of combined concepts have been shown in [58] by vertical and hori- zontal tests respectively. Here, we show the effect of combination on a sentence concept with respect to these properties. Consider the following two sentences and their vectors.
1. Tuberculosis is caused by a bacterium that commonly affects the lungs but can attack almost any organ.9 < (affects: 0.500), (attack: 0.281), (organ: 0.167), (organs: 0.129), (commonly: 0.110), (bacterium: 0.059), (lungs: 0.039), (caused: 0.036), (attacks: 0.029), (majority: 0.024), (preventable: 0.023), (vast: 0.023), (long: 0.012), (last: 0.011), (transmitted: 0.011), (communicable: 0.009), (ill: 0.009), (crowded: 0.009), (decades: 0.008), (air: 0.007), (workers: 0.007), (poor: 0.006), (highly: 0.003), (infection: 0.003), (sick: 0.003) >
7DUC-2001 dataset d15c, DOC NO: AP890302-0063 8There are others but we emphasize them because of their higher weights. 9DUC-2001 dataset d15c, DOC NO: AP900521-0063
25 2. The disease, which attacks the lungs, has long been associated with poor, crowded living condi- tions.10 < (poverty: 0.17), (history: 0.17), (crowded: 0.15), (ravaged: 0.143), (living: 0.138), (vengeance: 0.13), (affects: 0.125), (famous: 0.097), (opportunistic: 0.094), (attack: 0.094), (conditions: 0.091), (housing: 0.091), (attacks: 0.088), (cancer: 0.088), (organs: 0.081), (blamed: 0.074), (shortcomings: 0.067), (illness: 0.057), (organ: 0.056), (back: 0.053), (long: 0.049), (combination: 0.049), (socioeconomic: 0.048), (research: 0.048), (europe: 0.044) >
Sentence 1 talks about tuberculosis, its cause and the affected parts. This is evident from the sentence vector where, “affects”, “organs”, “bacterium”, “lungs” and “caused” are highly weighted. Further observe that, concepts “transmitted”, “communicable”, and “infection” are also highly weighted, which tells us about the nature of the disease. This shows that concept combination has enriched the sentence with new information and we obtain a refined representation of sentence. Similarly, sentence 2 also talks about tuberculosis (though the word does not occur exclusively in the sentence), affected parts and factors which are socioeconomic in nature. This is evident from high weights of “illness”, “organs” , “poverty”, “crowded”, “living”, “housing”, “conditions” and “socioeco- nomic” . We notice that both sentence talk about a common disease and its affected parts. Moreover, the first sentence talks about cause and the other about socioeconomic factor. The distinction can be observed from their respective vectors, where one weighs “bacterium” high whereas other weighs “poverty” high. This distinction can also be seen from the weight of “poor” in sentence 1, which is relatively low. From above we can conclude that a sentence vector obtained by concept combination has following properties:
1. It is a concept in the constructed conceptual space encapsulating all the inferential characteristics of the concepts composing the sentence.
2. It is highly enriched which provides more depth to the meaning of concept.
3. It has a sense of uniqueness and can disambiguate itself from a similar sentence in the given context.
Next, we describe our underlying principle to form summaries in the conceptual space. Based on which we propose two metrics and redundancy removing technique to realize the summaries.
3.4 Conceptual Multi-Document Summarization (CMDS)
This section describes construction of CMDS system. A schematic overview of the system is shown in Figure 3.2.
10DUC-2001 dataset d15c, DOC NO: AP900215-0031
26 Figure 3.2 Schematic-overview of the complete system
27 Figure 3.3 Representation of documents and summary in a 3-dimensional conceptual space.
3.4.1 Principle
th For a summary S containing l sentences we define WSj for the j dimension of S as, ∑m WSj = αiwtij i=1 { 1 if sentence i is present in S αi = 0 otherwise
Then we use the characteristic of the sentence representation to propose the following conjecture to form a summary,
Conjecture 1 A summary S, however concise it maybe, can provide maximum overview of the docu-
ments, if it contains those sentences which maximize WSj for maximum number of dimensions, given that all the concepts are treated uniformly.
Let, S and S’ be collection of l sentences.
Let, Ns = number of dimensions for which Wsj > Ws′j.
Let, Ns′ = number of dimensions for which Wsj < Ws′j.
Let, Ns′ > Ns . Since, all the concepts are treated equally (given), S’ provides a better overview of text than S because S’ contains text which gives more information by maximizing more concepts than S. Figure 3.3 shows a pictorial representation of documents and their summary in a 3-dimensional conceptual space. From this, it is quite obvious that when the area covered by summary increases its resemblance to the documents increase. This means that the meaning (of summary) starts becoming more similar to that of the documents. Summary area can be increased in two ways: first by taking
28 more concepts in the summary and second by choosing those concepts which have high weights for these dimensions. However, the summary size is restricted, so we adopt the second way of choosing concepts (sentences) having higher dimensional weights. Based on this principle, sentences are scored using following metrics.
3.4.2 Metrics
1. Rank: In each dimension, sentences are ranked in decreasing order of their weights. Let rij
denote the rank of sentence i along dimension dj and sci denote score of sentence i, then the score across all the dimension is computed as follows: ( ) ∑n 1 sci = √ xϵ[1, 8] (3.4) x r j=1 ij
For every dimension inverse of xth root of the rank is added to the score.
2. Weight: Weight of a sentence along a given dimension directly represent its strength for that dimension. So, the score is computed by merging the weights of a sentence for all its dimensions as follows: ∑n (√ ) sci = y wij yϵ[1, 5] (3.5) j=1
For every dimension yth root of the weight is added to the score.
3.4.3 Redundancy Removal
Redundancy in a summary should be minimal. In order to create non-redundant summaries, concepts covered in the summary are removed from the conceptual space. This reduces the dimensionality of the space and as a result concept vector are represented as,
ci =< wti1, ...wtij, ....wtin >
such that,
∄dj for which djϵS ∧ djϵ CS
Hence, further scoring of sentences is done upon the remaining dimensions. This reduces the search space and selects sentences encapsulating remaining concepts, covering all the topics and making the summary non-redundant. Selected sentences will not be ranked again as the new search space does not contain any of its constituent concepts. Algorithm 1 describes summary formation procedure. Algorithm 2 describes update score function. Note that, at a given time only one of the two metrics is used to and the other is assigned 0 value.
29 Algorithm 1 Summary Formation 1: Input:
• The sentence set Sets = [s1, s2, ..., sm] • The word set: Setw = [w1, w2, ..., wn] • Sentence Vectors: V = [v1, v2, ..., vm] • Summary size limit: L • Root of Rank: x • Root of weight: y 2: Output:
• Set of summary sentences: S ⊆ Sets 3: Procedure: 4: initialize sc ; 5: initialize wordF lag ← {F alse}n ; 6: while size(S) < L do 7: sc ← {0}m; 8: for i ← 1, n do 9: 10: if wordF lag[i] ≠ T rue then 11: sc ← UpdateScores(V∗i, sc, x, y); 12: end if 13: end for 14: i ← indexOfMaxScore(sc); 15: S ← S + si; 16: for all wϵsi do 17: wordF lag(indexOf(w)) ← T rue; 18: end for 19: end while 20: return S
3.5 Experimental Setup
In this study, we used standard summarization datasets DUC 2001 and DUC 2002 for evaluation. These datasets were chosen because standard human summaries are available for them. The important feature of these summaries is that they were built to evaluate generic summarization tasks 11. So, these datasets can be effectively used to evaluate any generic text summarizer. DUC 2001 and DUC 2002 contain 30 and 60 document sets respectively, with 10 news articles in each set. Sentences in DUC 2001 were separated manually. For DUC 2002, they have been separated by NIST. Stopwords were removed before the summarization process based on the list provided by MIT.12 For evaluation purposes, DUC 2001 and DUC 2002 provide 4 human and 2 human summaries of size 50, 100, 200 and 400 words respectively. Note that, DUC 2002 does not contain human summaries of
11http://www-nlpir.nist.gov/projects/duc/guidelines/2001.html http://www-nlpir.nist.gov/projects/duc/guidelines/2002.html 12http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
30 Algorithm 2 Update Scores 1: Input:
• Weight vector(= V∗i): W t = [wt1, wt2, ..., wtm] • Sentence scores: sc = [sc1, sc2, ..., scm] • Root of Rank: x • Root of weight: y 2: Output:
• Updated sentence scores: sc = [sc1, sc2, ..., scm] 3: Procedure: 4: initialize r ← rank of ith sentence; 5: if x > 0 then 6: for i ← 1, m do 1 7: sc ← sc + √ ; i i x [ri] 8: end for 9: else if y > 0 then 10: for i ← 1, m do√ y 11: sci ← sci + [wi]; 12: end for 13: end if size 400 (words). We have considered results of DUC 2001 more reliable because they are evaluated on more human summaries. All the evaluation scores are computed using ROUGE. It has been been widely used by DUC to evaluate system summaries. We choose two automatic evaluation methods ROUGE- 1,2 and ROUGE-SU4 which compute unigram, bigram recall measure and overlap of skip-bigrams13 respectively. We have conducted the following experiments:
1. Intrinsic experiments
• Variation of window size (K) between 1 to 9.
• Variation of xth root between 1 to 8 for rank metric.
• Variation of yth root between 1 to 5 for weight metric.
2. Extrinsic evaluation
• Comparison of CMDS summaries with previous state-of-the-art systems (briefly described later).
13Skip-bigrams are those pair of words which allow arbitrary gaps, but preserve sentence order. ROUGE-SU4 allows skip distance of 4 words.
31 3.6 Results and Discussion
3.6.1 Intrinsic Experiments
3.6.1.1 Effect of variable window size:
In this section we discuss the quality of summaries when window size ”K” is varied. Recall that window size is an intrinsic parameter to HAL which governs the number of co-occurrence relations to be captured. However, longer window may result in forming false associations between words. Figure 3.4, shows graphs for ROUGE scores, when K is varied between 1 and 9. A significant rise can be observed on summary quality from K = 1 to K = 3. This rate of change decreases and gradually stabilize after K = 5. We have kept K = 6 for creating summaries as it gives slightly improved results.
90 Words 100 Words 200 Words 400 Words DUC 2001 0.5 0.12 0.18 0.45 0.1 0.16 0.4 0.14 0.35 0.08 (a) (b) (c) 0.12 0.3 0.06 0.1 ROUGE−I Scores 0.25 ROUGE−II Scores
ROUGE−SU4 Scores 0.08 0.04 0.2 0.06
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Window Size (K) Window Size (K) Window Size (K)
DUC 2002 0.09 0.4 0.14 0.08 0.35 0.12 0.07
0.3 0.06 0.1 (d) (e) (f) 0.05 0.25 ROUGE−I Scores 0.08 ROUGE−II Scores ROUGE−SU4 Scores 0.04 0.2 0.06 0.03 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Window Size (K) Window Size (K) Window Size (K)
Figure 3.4 Summary quality v/s window size (K)
3.6.1.2 Effect of variable metrics:
Sentence scoring is an important part of summarization. In CMDS it is influenced by two metrics: rank and weight. Here, we discuss the behavior of summaries when the metrics are varied and finally, arrive at some preferred values of these metrics. Figure 3.5, show graphs for ROUGE scores, when x is varied between 1 and 8. We can observe that for longer summaries both DUC 2001 and DUC 2002 datasets show similar behavior. In the first
32 dataset, when the value of x is increased from 1 to 5 quality of summaries improve gradually. However, the quality is either constant or decreases for x between 5 and 8. Similarly, for second dataset quality is less at either side of x = 4. These observations indicate that optimum summaries can be obtained when x lies between 4 and 5. For shorter summaries more variation can be observed in DUC 2002 dataset. However, the trend remains similar to that of longer summaries. As a result, optimal summaries can be obtained when x lies between 5 and 6. From these observations we conclude that for x = 5 optimal quality can be achieved for both longer and shorter summaries.
DUC 2001
0.12 0.18 0.45 0.1 0.16 0.4 0.14 0.35 0.08 0.12 (a) (b) (c) 0.3 0.06 0.1 ROUGE−I Scores 0.25 ROUGE−II Scores
ROUGE−SU4 Scores 0.08 0.04 0.2 0.06 0.02 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x (Root of Rank) x (Root of Rank) x (Root of Rank)
DUC 2002
0.09 0.4 0.14 0.08 0.35 0.12 0.07
0.3 0.06 0.1 (d) (e) (f) 0.05 0.25
ROUGE−I Scores 0.08 ROUGE−II Scores 0.04 ROUGE−SU4 Scores 0.2 0.06 0.03 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x (Root of Rank) x (Root of Rank) x (Root of Rank)
Figure 3.5 Summary quality v/s xth root of rank metric
Figure 3.6, show graphs for ROUGE scores, when y is varied between 1 and 5. It is seen that for most of the summaries an optimum quality is achieved for y = 2. This remains consistent for higher values (3, 4 and 5). For higher roots of weights scores of sentences are closer to each other. This helps because we are adding the scores over large number of dimensions (equals to total number of different words in documents). Thus, for optimal summary formation using CMDS we use x = 5 for rank metric and y = 2 for weight metric.
3.6.2 Extrinsic Evaluation
For extrinsic evaluation we have compared our system to previous state-of-the-art systems. Follow- ing is a brief description of those systems:
1. Random (baseline): Random sentences are selected for the summary.
2. LSA [17]: SVD is applied on the terms by sentences matrix and highest ranking sentences are selected for summary.
33 DUC 2001 0.5 0.12 0.18 0.45 0.1 0.16 0.4 0.14 0.35 0.08 0.12 (a) (b) 0.3 (c) 0.06 0.1 ROUGE−I Scores 0.25 ROUGE−II Scores 0.08 0.04 ROUGE−SU4 Scores 0.2 0.06 0.02 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 y (Root of Weight) y (Root of Weight) y (Root of Weight)
DUC 2002 0.15 0.09 0.4 0.08 0.35 0.07
0.3 0.06 0.1 (d) (e) (f) 0.05 0.25 ROUGE−I Scores ROUGE−II Scores
0.04 ROUGE−SU4 Scores 0.2 0.03 0.05 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 y (Root of Weight) y (Root of Weight) y (Root of Weight)
Figure 3.6 Summary quality v/s yth root of weight metric
3. TR-* [44]: Graph14 based text summarization method called TextRank. This method considers the sentences as graph nodes then uses modified HITS [29], Positional Power Function [21] and PageRank [50] algorithms to rank the sentences. Top ranked sentences are taken into summary.
4. ClusterHITS [61]: Graph based text summarization method in which topic clusters are consid- ered as hubs and sentences are considered as authorities then uses HITS algorithm to rank the sentences. Top most ranked sentences are taken into summary.
5. DSDR-nonlin [20]: Data reconstruction based text summarization approach that generates sum- mary which gives the best reconstruction of the original document. Nonnegative linear recon- struction, which allows only additive, not subtractive, linear reconstructions
Table 3.1 1 shows ROUGE-II scores for all the systems on DUC-2001 and DUC-2002 data. Our system (CMDS) outperforms all other systems. The improvement increases with increase in summary size. Figure 3.7, shows the graphical representation of all the ROUGE scores for all the systems. It can be observed that our system performs better for longer summaries. We attribute this observation to two reasons: first, sentence scoring favors those sentences which give maximum coverage in their conceptual space. This is more probable for longer sentences. Second, redundancy removal further helps by capturing as many concepts as the summary size allows. These evaluations show that CMDS is more effective to generate longer summaries while being comparable for shorter summaries.
14Undirected version of the graph is used because, the direction could not be decided between sentences of different documents.
34 Table 3.1 Average F-measure (ROUGE-2) scores for various state-of-the-art systems DUC 2001 DUC 2002 System Summary size (in words) Summary size (in words) 50 100 200 400 50 100 200 Random 0.01639 0.03292 0.0452 0.08138 0.02227 0.03835 0.06043 LSA 0.02641 0.03928 0.06158 0.08844 0.03072 0.04427 0.06703 TR(HITS) 0.03659 0.05986 0.07598 0.10414 0.05124 0.05941 0.08394 TR(PPF) 0.0361 0.04597 0.069 0.09753 0.04955 0.05438 0.07091 TR(PageRank) 0.03237 0.05442 0.07692 0.10772 0.0475 0.06397 0.08709 ClusterHITS 0.03907 0.05234 0.07457 0.09648 0.04949 0.05879 0.08026 DSDR-nonlin 0.02638 0.04721 0.06862 0.1021 0.02933 0.04674 0.07541 CMDS 0.03971 0.06155 0.08154 0.11467 0.05209 0.0682 0.09575
3.7 Summary and Conclusion
In this work we have used an inferential space to find the most informative summary for a set of documents. The space has the ability to evolve by combination, such that more refined and contextually clear concepts can be obtained. Based on these characteristics, we have formulated a conjecture. The conjecture suggests that an ideal summary should encapsulate as many concepts as possible which are present in the documents. Following this we introduced two scoring metrics to score the sentences. These scoring metrics have produced quality summaries which have been verified by extrinsic experi- ments. Intrinsic experiments provide an insightful picture of these scoring parameters on the quality of a summary. This work contributed in the following manner: First, Extension of the concept combination char- acteristic of conceptual space to define the sentences. Second, Proposition of heuristics to combine concepts which underline the extension differing from previous approaches. Third, A novel theoretical framework for summary formation with experimentally estimated parameters. We conclude that conceptual spaces is an efficient cognitive model to represent text. Its characteris- tics allows us to solve various problems, summarization being one of them. This is demonstrated from Figure 3.8 which shows the final summary for d15c documents in DUC 2001 dataset.
35 50 Words 100 Words 200 Words 400 Words 50 Words 100 Words 200 Words
DUC 2001 DUC 2002 0.5 0.4
0.4 0.35
0.3
(a) 0.3 (b) 0.25
0.2 0.2 0.15 ROUGE−I Scores ROUGE−I Scores 0.1 0.1 0.05 0 0
LSA LSA CMDS CMDS Random TR(PPF) Random TR(HITS) TR(PPF) TR(HITS) ClusterHITS ClusterHITS DSDR−nonlin DSDR−nonlin TR(PageRank) TR(PageRank)
0.12 0.1
0.1 0.08
0.08 (c) (d) 0.06 0.06 0.04 ROUGE−II Scores
ROUGE−II Scores 0.04
0.02 0.02
0 0
LSA LSA CMDS CMDS Random TR(PPF) Random TR(HITS) TR(PPF) TR(HITS) ClusterHITS ClusterHITS DSDR−nonlin DSDR−nonlin TR(PageRank) TR(PageRank)
0.15
0.15
0.1 (e) (f) 0.1
0.05 ROUGE−SU4 Scores 0.05 ROUGE−SU4 Scores
0 0
LSA LSA CMDS CMDS Random TR(HITS) TR(PPF) Random TR(HITS) TR(PPF) ClusterHITS ClusterHITS DSDR−nonlin DSDR−nonlin TR(PageRank) TR(PageRank)
Figure 3.7 Graphical representation of ROUGE scores for all the systems
36 1. The doctors warned that besides being at risk of getting tuberculosis themselves, AIDS- infected addicts who carry the TB bacteria also may pass the germs to people they live with, to health care workers and other people. 2. The agency estimated that between 15 million and 20 million adults will be infected with HIV by the year 2000, and it predicted that the number of cases and deaths from tu- berculosis will rise sharply as a result, especially in sub-Saharan Africa, Latin America and Southeast Asia. 3. The health department said it is providing tuberculosis testing and treatment for the Hu- man Resources Administration’s program for the homeless, and will train staff members on tuberculosis prevention and control. 4. The U.N. agency, in its first comprehensive look at global tuberculosis in a decade, said the disease kills nearly 3 million people a year, most of them between the ages of 15 and 59, ”the segment of the population that is economically most productive”. 5. While most people with the AIDS virus eventually go on to get acquired immune defi- ciency syndrome, people who carry the tuberculosis bacteria ordinarily have only about a 10 percent life-long risk of getting TB. 6. Snider said 10 million to 15 million Americans have been infected with the tuberculosis germ, but only a small percentage of them develop the disease because their immune system was strong enough to prevent the disease from developing. 7. The department also has an established residence for homeless tuberculosis patients, and is working with substance-abuse treatment services to extend tuberculosis prevention in its programs. 8. The Board of Health approved a resolution last year requiring all children entering city schools to be tested. 9. Seven of the eight TB cases occurred in people who were already infected with tubercu- losis bacteria before the study began. 10. NEW YORK – The incidence of active tuberculosis cases in the city rose 38% in 1990, to 3,520 cases, according to the health commissioner.
Figure 3.8 Final CMDS Summary for DUC2001: d15c
37 Chapter 4
Multilingual Multidocument Text Summarization
Knowledge cannot be bound by languages however, its expression and expansion depends on its linguistic source. People can relate to a text source if it is written in their native language. This means that the information present in the most common language will be accepted by larger part of society. However, the information in lesser known languages is left out by the greater part of the society. Hence, a tool which can cover and extract information from multiple linguistic sources can give diverse and complete information. Summarization aims to give a complete overview of a set of documents (topically similar). Language bound methods can handle text in a single language. The methods may be highly accurate for their domain language but, may not be able to work upon documents in different languages. Thus, summarization methods which can generate summaries from text coming from multiple sources is not only necessary but will be needed.
Most of the previous approaches use clustering and translation of documents to form summary. The basic idea of these techniques [5, 12, 6] is that they collect similar information together by clustering techniques for every language. Then, they find similar clusters across different language by translating clusters in one language and identifying similar cluster in another language. Final output is produced in user desired language by substituting all the different linguistic sentences with a similar sentence in the required language. It has been shown by Chen et al. [5] that translation after clustering performs better than translation before clustering.
However, earlier approaches do not address an important aspect: What is the effect of “Added noisy information” on the summary quality. The noise part of the information originated due to interaction of two languages either by translation or transliteration. The addition of information part originates from the fact that different documents can contain different information. Furthermore, if there is a difference in information across different languages then multi-lingual summary should be able to encapsulate all this information. Figure 4.1 depicts the concept of Noisy Information. The common part (overlapping part in the figure) usually contains the general overview of the topic covered in both documents along with some key points. The dotted part represents the new information which should be incorporated in the summary.
38 Figure 4.1 Added Noisy Information
4.1 MultiLingual Summarization using Jensen-Shannon Divergence