Anchoring Discourse Signals Across Genres a Thesis Submitted to The

Signaling of Discourse Relations: Anchoring Discourse Signals across Genres A Thesis submitted to the Faculty of the Graduate School of Arts and Sciences of Georgetown University in partial fulfillment of the requirements for the degree of Master of Science in Linguistics By Yang Liu, B.A. Washington, DC April 1, 2019 Copyright c 2019 by Yang Liu All Rights Reserved ii Signaling of Discourse Relations: Anchoring Discourse Signals across Genres Yang Liu, B.A. Thesis Advisor: Amir Zeldes, Ph.D. Abstract Discourse Relations, also known as coherence or rhetorical relations, characterize the semantic or pragmatic relationships between clauses or sentences in discourse. Such relations are established in order to facilitate effective communication. In addi- tion to the inventory of relations, previous research has also investigated how discourse relations are established or signaled. Discourse markers (DMs) are considered to be the most typical signals in discourse; however, focusing merely on DMs is inade- quate as they can only account for a small number of relations in discourse. Thus, researchers have been exploring textual signals beyond DMs such as the Penn Dis- course Treebank 2.0 (PDTB, Prasad et al. [22]) and the Rhetorical Structure Theory Signalling Corpus (RST-SC, Das and Taboada [5]). Despite their different theoretical groundings and approaches to relation signaling, both corpora annotated the Wall Street Journal (WSJ) section of the Penn Treebank (PTB, Marcus et al. [19]), i.e. the news articles. Nevertheless, previous work has suggested that signaling information is indicative of genres (e.g. Taboada and Lavid [28]; Zeldes [34]). Therefore, this project aims to anchor signaling devices on a more diverse corpus to demonstrate the inadequacy of signaling by DMs only, the abundance of open-class signals, and more importantly, the distribution of signaling devices across genres. Index words: discourse relations, relation signaling, signaling devices, genre-specific signals, corpus linguistics, linguistic annotation iii Dedication TO MOM & DAD I would like to dedicate this thesis to my beloved parents for nursing me with love, encouragement, and unconditional support. Thank you for always having faith in me. iv Acknowledgments I would like to express my great gratitude to Dr. Amir Zeldes for guiding me through the fascinating world of discourse relations signaling, inspiring me all through my research, and providing me the opportunity to work with him. Without his insightful guidance and tremendous help, I would not have been able to accomplish this thesis. I would also like to thank my faculty advisor Dr. Nathan Schneider for his invalu- able and insightful feedback on my research proposal, which helped me navigate and comb through my ideas on discourse-related research. Special thanks go to Dr. Maite Taboada and Dr. Debopam Das, who patiently answered my questions regarding their research and encouraged me to follow up with this line of research. I would also like to thank Luke D. Gessler, a first-year Ph.D. student in computa- tional linguistics at Georgetown University, for his generous help with the annotation interface. Without his prompt assistance, I would not have been able to process my data conveniently and efficiently. Last but not least, I would also like to thank the members of the Department of Linguistics, Georgetown University, including Erin Esch Pereira and Benjamin Croner, for their coordination through the process of this thesis. To my parents, I owe a debt of gratitude. I would like to thank my parents for their unconditional love and support and having always been there for me. v Table of Contents Chapter 1 Introduction . .1 1.1 Motivation . .3 1.2 Methodology . .5 1.3 Organization of the Thesis . .5 2 Background . .6 2.1 Discourse Relations . .6 2.2 Relation Signaling . .8 2.3 Rhetorical Structure Theory (RST) . .9 2.4 The RST Signalling Corpus (RST-SC) . 13 2.5 The Signal Anchoring Mechanism . 14 3 Methodology . 16 3.1 The Georgetown University Multilayer (GUM) Corpus . 16 3.2 Annotation Tool . 19 3.3 Annotation Procedure . 19 3.4 Annotation Scheme . 21 3.5 Annotation Reliability . 27 3.6 The Taxonomy of Discourse Signals . 27 3.7 Examples of Signal Anchoring . 32 4 Results and Analysis . 38 4.1 Overview . 38 4.2 Distribution of Signals regarding Relations . 39 4.3 Distribution of Signals across Genres . 47 4.4 Interim Conclusion . 51 5 Conclusion . 53 Bibliography . 55 vi List of Figures 2.1 A Graphical Representation of an RST Analysis. 11 2.2 Hierarchical Taxonomy of Signals in RST-SC (Fragment). 14 2.3 A Visualization of the Signaling Annotation Scheme. 14 3.1 A Visualization of How Strongly Each Genre Signals in the GUM Corpus. 17 3.2 A Nucleus-Satellite Restatement in GUM. 18 3.3 A Multinuclear Restatement in GUM. 18 3.4 Signal Annotation from RST-SC in the UAM Tool. 20 3.5 A Visualization of the Annotation Interface in rstWeb. 20 3.6 An Instance of Signal Annotation in rstWeb. 20 3.7 Signals in a CCDT View. 24 3.8 Signals in a Hierarchical View. 25 3.9 A Visualization of Example (6). 26 4.1 Evaluation: Distribution of Signals. 42 4.2 Justify: Distribution of Signals. 42 4.3 An Example of Signaling Sequence Using World Knowledge. 45 vii List of Tables 2.1 Different Approaches to Discourse Relations Summarized by Stede [26].7 2.2 Classification of Subject Matter and Presentational Relations in RST. 12 3.1 RST Relations used in the GUM Corpus. 18 4.1 Distribution of Unanchored Relations. 39 4.2 Distribution of Signal Types and its Comparison to RST-SC. 40 4.3 Distribution of Most Common Signals regarding Relations. 41 4.4 Examples of Anchored Tokens across Relations. 44 4.5 Distribution of Signaled Relations across Genres. 48 4.6 Examples of Anchored Tokens across Genres. 48 viii Chapter 1 Introduction Sentences do not exist in isolation, and the meaning of a text or a conversation is not merely the sum of all the sentences involved. In other words, an informative text contains sentences whose meanings are relevant to each other rather than a random sequence of utterances. Moreover, some of the information in texts is not included in any one sentence but in their arrangement. Therefore, a high-level analysis is required in order to facilitate effective communication in discourse, which could benefit both linguistics study and NLP applications. For instance, an automatic discourse parser that successfully captures how sentences are connected in texts could serve tasks such as information extraction and text summarization. To be specific, a discourse is delineated in terms of relevance between textual elements. Linguists usually categorize such relevance into cohesion and coherence respectively. Cohesion refers to linguistic means that link one element to another in a local discourse environment such as connectives (e.g. ‘thus’, ‘but’, ‘afterwards’), related words (e.g. synonymy, meronymy and hyponymy), and pronouns (i.e. identifying a pronoun’s antecedent) etc. Coherence, on the other hand, refers to semantic or pragmatic linkages that hold between larger textual units in a discourse such as Cause, Contrast, and Elaboration etc. Moreover, there are certain linguistic devices that systematically signal certain discourse relations: some are generic signals across the board while others are indicative of particular relations in certain domains. 1 For instance, consider the following example from the Georgetown University Multi- layer (GUM) corpus [32]1, in which the two textual units connected by the discourse marker but form a Contrast relation, meaning that the contents of the two textual units are comparable yet not identical. (1) Related cross-cultural studies have resulted in insufficient statistical power, but interesting trends ( e.g., Nedwick, 2014 ). – Contrast [academic_implicature] Generic signals are also ambiguous as they do not indicate strong associations with the relations they signal. For instance, there are three similar relations that can express adversativity: Contrast, Concession, and Antithesis. The relation Concession means that the writer acknowledges the claim presented in one textual unit but still claims the proposition presented in the other discourse unit while Antithesis dismisses the former claim in order to establish or reinforce the latter. In spite of the differences in their pragmatic functions, these three relations can all be frequently signaled by the coordinating conjunction but: symmetrical Contrast as in (1), Concession as in (2), and Antithesis as in (3). (2) The cardinal numbers included in this study were only one through five in order to avoid additional item variability, but larger numbers should be included in future research. – Concession [academic_implicature] (3) NATO had never rescinded it, but they had and started some remilitarization. – Antithesis [interview_chomsky] 1The square brackets at the end of each example contain the document ID from which this example is extracted. Each ID consists of its genre type and one keyword assigned by the annotator at the beginning of the annotation task. 2 This introductory chapter provides an overview of the current project including its motivation, goals, and potential contribution to this line of research. 1.1 Motivation Understanding discourse relations and their signaling information is a rewarding task since it can provide valuable linguistic insights into discourse parsing as well as lan- guage production and comprehension for psycholinguistic studies. The inventory of discourse relations varies within and across frameworks, as thoroughly demonstrated in Hovy and Maier [10]; and researchers could hardly reach an agreement on defining a standard set of relations. That said, a more fundamental and underlying question to ask is how these discourse relations are established in the first place. The answer to this question can in turn contribute to the classification of discourse relations. For instance, Knott and Dale [13] suggested a bottom-up approach for determining a set of relations by identifying the cue phrases associated with them.

Load more