Representing Discourse for Automatic Text Summarization Via Shallow NLP Techniques

Representing discourse for automatic text summarization via shallow NLP techniques PhD Thesis Laura Alonso i Alemany Departament de Lingü´ıstica General Universitat de Barcelona Barcelona, 2005 Advisors Irene Castellón Masalles Llu´ıs PadróCirera Universitat de Barcelona Universitat Politècnica de Catalunya there is life beyond perfection Abstract In this thesis I have addressed the problem of text summarization from a linguistic perspective. After reviewing some work in the area, I have found that many satisfactory approaches to text summarization rely on general properties of language that are reflected in the surface realization of texts. The main claim of this thesis is that some general properties of the discursive organization of texts can be identified at surface level, that they provide objective evidence to support theories about the organization of texts, and that they can be of help to improve existing approaches to text summarization. In order to provide support for this claim, I have determined which shallow cues are indicative of discourse organization, and, of these, which fall under the scope of the natural language processing techniques that are currently available for Catalan and Spanish: punctuation, some syntactical structures and, most of all, discourse markers. I have developed a framework to obtain a representation of the discursive organization of texts of inter- and intra-sentential scope. I have described the nature of minimal discourse units (discourse segments and discourse markers) and the relations between them, relating theoretical explanations with empirical descriptions, systematizing discursive effects as signalled by shallow cues. Based on the evidence provided by shallow cues, I have induced an inventory of basic meanings to describe the discursive function of relations between units as signalled by shallow cues. The inventory is organized in two dimensions of discursive meaning: structural (continuation and elaboration) and semantic (revision, cause, equality and context). In turn, each dimension is organized in a range of markedness, with a default meaning that is assigned to unmarked cases. I have shown that the proposed representation contributes to improve the quality of automatic summaries in two different approaches: it was integrated within a lexical chain summarizer, to obtain a representation of text that combines cohesive and coherence aspects of texts, and it was also one of the components of the analysis of text in an e-mail summarizer. Finally, some experiments with human judges indicate that this representation of texts is useful to explain how some discursive features influence the perception of relevance, more concretely, to characterize those fragments of texts that judges tend to consider irrelevant. Experiments with automatic procedures for the analysis of texts correlate well with the perception of relevance observed in human judges. iii Acknowledgements If a thesis supposes both a personal progress and a scientific contribution, I feel the balance between the two is far more advantageous for me than for the scientific community, as will be reflected in these acknowledgements. In the first place, I have to say that this work would have never been possible without the financial support of the Spanish Research Department, which paid for four years of my learning under grant PB98-1226. First of all, a special acknowledgement goes to the people who have revised and read part or the whole of this thesis, specially my advisors and the programme committee, because they had to suffer this ability that runs in my family, that we make the telling of a story last longer than the story itself. I want to sincerely thank are my supervisors, Irene Castellón and Llu´ıs Padró. Their support never failed, despite difficulties at all levels. They proved to be a neverending source of sense, perspective and encouragement. I wish I can some time be able to express my gratitude for all their help. During this time I have worked with many other people who helped me come closer to being a scientist. Karina Gibert helped me to put some order in the chaos I had in front of me, I learny to have fun with achievements with Bernardino Casas, Gemma Boleda had and kept the will to continue working until we could reach something good, and I learnt so much from and with Maria Fuentes that I could not possibly write it here. Moreover, a lot of people annotated texts just because it is good to help the others, specially Ezequiel and Robert, who spent many sunday afternoons thinking about causes and consequences. An important part of the fact that this thesis was begun, pursued and (finally!) finished has to be attributed to my attending academic events, where people made me feel part of a community and made my work worth, and where I could have a clearer idea of what I could expect from research and what I wanted to avoid. Of these, the three wise men provided most valuable advice in the plainest way. Horacio Rodr´ıguez kept reminding me that the main purpose of NLP is that we have jobs, but never failed to recognise the beauty of a good explanation, the interest of unrealizable ideas and the use of (some of) our implementations. Toni Badia was most kind in helping me to interpret things that were far beyond my perspective. Henk Zeevat made me feel that not knowing things was not necessarily a shame, but a circumstance, and that getting to know things required some effort but was worthwhile, and never found it difficult to spend as much time as necessry talking, until things made sense. Finally, the memory of my grandfather assured me that it was important to try to carry out solid work, regardless of how far I might get. v vi Most specially, I would like to thank all those people who, without having anything to do from my work, suffered directly from the fact that I was working. First of all, my dear Oscar,` who suffered my reading on Sundays, my stress for deadlines and all my anxiety and uncertainties, but enjoyed almost nothing of my maturity. Then, Maria had to listen to ununderstandable talk about my research, renouncing for years to the wonderful apple-talk we have finally recovered. Josep had to listen to my constant mhms and ouchs and brrs, but we always managed to share precious silences. My family hardly remembered my face when I began visiting them again, but my brother and my mother always provided support in the best way they knew: loads of home-made food, piles of references to apparently useless technicalities, and most of all a strong basis for my tired and weak principles and unconditional love. Finally, I thank Gabriel, for furnishing my feet with solid ground, so that we could walk on to the last stages of this work. Contents 1 Introduction 3 1.1 Motivation.................................... 3 1.2 Aimofthethesis ................................ 6 1.3 Summaryofthethesis ............................. 8 2 Approaches to text summarization 11 2.1 Introduction................................... 12 2.1.1 Text summarization from a computational perspective ....... 12 2.2 Evaluationofsummaries . 17 2.3 Stateoftheart ................................. 20 2.3.1 Lexicalinformation........................... 21 2.3.2 Structuralinformation . 21 2.3.3 Deepunderstanding .......................... 22 2.3.4 Approaches combining heterogenous information . ....... 22 2.3.5 Criticaloverview ............................ 23 2.4 Discoursefortextsummarization . .... 25 2.4.1 Previous work in exploiting discourse for text summarization . 26 2.4.2 Utility of shallow discourse analysis . ..... 29 2.5 Discourse for text summarization via shallow NLP techniques ....... 46 2.5.1 DelimitingshallowNLP . 47 2.5.2 Approach to the representation of discourse . ..... 48 2.6 Discussion.................................... 49 3 Specifying a representation of discourse 51 3.1 Assumptions about the organization of discourse . ......... 52 3.2 A structure to represent discourse . ..... 53 3.2.1 Delimiting a representation of discourse . ...... 53 3.2.2 Specificationofthestructure. .. 56 3.3 Discoursesegments ............................... 57 3.3.1 Previous work on computational discourse segmentation ...... 58 3.3.2 Definition of discourse segment . 63 3.4 Discoursemarkers................................ 69 3.4.1 Previous approaches to discourse markers . ..... 69 3.4.2 Definitionofdiscoursemarker . 72 vii viii Contents 3.4.3 Representing discourse markers in a lexicon . ..... 74 3.5 Discussion.................................... 80 4 Meaning in discourse relations 83 4.1 Linguistic phenomena with discursive meaning . ........ 84 4.1.1 Shallow linguistic phenomena with discursive meaning ....... 85 4.1.2 Reliability of shallow cues to obtain discourse structures ...... 88 4.2 Discursive meaning inferrable from shallow linguistic phenomena . 88 4.2.1 Advantages of compositional discourse semantics . ........ 89 4.2.2 Determining an inventory of discursive meanings . ....... 96 4.2.3 A minimal description of discourse relations . ...... 103 4.3 Discussion.................................... 119 5 Empirical Support 121 5.1 Significanceofempiricaldata . 122 5.1.1 Studyofvariance ............................ 123 5.1.2 Hypothesistesting ........................... 124 5.1.3 Ratioofagreement ........................... 125 5.2 Human judgements on discourse segments . .... 127 5.2.1 Corpusandjudges ........................... 130 5.2.2 Studyofvariance ............................ 136 5.2.3 Probability that a word is removed . 137 5.2.4 Latentclassanalysis

Representing Discourse for Automatic Text Summarization Via Shallow NLP Techniques

Lexical Resource Reconciliation in the Xerox Linguistic Environment

Wordnet As an Ontology for Generation Valerio Basile

Modeling and Encoding Traditional Wordlists for Machine Applications

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Unification of Multiple Treebanks and Testing Them with Statistical Parser with Support of Large Corpus As a Lexical Resource

Instructions for ACL-2010 Proceedings

The Interplay Between Lexical Resources and Natural Language Processing

Exploring Complex Vowels As Phrase Break Correlates in a Corpus of English Speech with Proposel, a Prosody and POS English Lexicon

Methods for Efficient Ontology Lexicalization for Non-Indo

Maurice Gross' Grammar Lexicon and Natural Language Processing

Using Relevant Domains Resource for Word Sense Disambiguation

Using a Lexical Semantic Network for the Ontology Building Nadia Bebeshina-Clairet, Sylvie Despres, Mathieu Lafourcade