Signaling of Relations: Anchoring Discourse Signals across Genres

A Thesis submitted to the Faculty of the Graduate School of Arts and Sciences of Georgetown University in partial fulfillment of the requirements for the degree of Master of Science in Linguistics

By

Yang Liu, B.A.

Washington, DC April 1, 2019 Copyright c 2019 by Yang Liu All Rights Reserved

ii Signaling of Discourse Relations: Anchoring Discourse Signals across Genres

Yang Liu, B.A.

Thesis Advisor: Amir Zeldes, Ph.D.

Abstract

Discourse Relations, also known as coherence or rhetorical relations, characterize the semantic or pragmatic relationships between clauses or sentences in discourse. Such relations are established in order to facilitate effective communication. In addi- tion to the inventory of relations, previous research has also investigated how discourse relations are established or signaled. Discourse markers (DMs) are considered to be the most typical signals in discourse; however, focusing merely on DMs is inade- quate as they can only account for a small number of relations in discourse. Thus, researchers have been exploring textual signals beyond DMs such as the Penn Dis- course Treebank 2.0 (PDTB, Prasad et al. [22]) and the Rhetorical Structure Theory Signalling Corpus (RST-SC, Das and Taboada [5]). Despite their different theoretical groundings and approaches to relation signaling, both corpora annotated the Wall Street Journal (WSJ) section of the Penn Treebank (PTB, Marcus et al. [19]), i.e. the news articles. Nevertheless, previous work has suggested that signaling informa- tion is indicative of genres (e.g. Taboada and Lavid [28]; Zeldes [34]). Therefore, this project aims to anchor signaling devices on a more diverse corpus to demonstrate the inadequacy of signaling by DMs only, the abundance of open-class signals, and more importantly, the distribution of signaling devices across genres.

Index words: discourse relations, relation signaling, signaling devices, genre-specific signals, corpus linguistics, linguistic annotation

iii Dedication

TO MOM & DAD I would like to dedicate this thesis to my beloved parents for nursing me with love, encouragement, and unconditional support. Thank you for always having faith in me.

iv Acknowledgments

I would like to express my great gratitude to Dr. Amir Zeldes for guiding me through the fascinating world of discourse relations signaling, inspiring me all through my research, and providing me the opportunity to work with him. Without his insightful guidance and tremendous help, I would not have been able to accomplish this thesis. I would also like to thank my faculty advisor Dr. Nathan Schneider for his invalu- able and insightful feedback on my research proposal, which helped me navigate and comb through my ideas on discourse-related research. Special thanks go to Dr. Maite Taboada and Dr. Debopam Das, who patiently answered my questions regarding their research and encouraged me to follow up with this line of research. I would also like to thank Luke D. Gessler, a first-year Ph.D. student in computa- tional linguistics at Georgetown University, for his generous help with the annotation interface. Without his prompt assistance, I would not have been able to process my data conveniently and efficiently. Last but not least, I would also like to thank the members of the Department of Linguistics, Georgetown University, including Erin Esch Pereira and Benjamin Croner, for their coordination through the process of this thesis. To my parents, I owe a debt of gratitude. I would like to thank my parents for their unconditional love and support and having always been there for me.

v Table of Contents

Chapter 1 Introduction ...... 1 1.1 Motivation ...... 3 1.2 Methodology ...... 5 1.3 Organization of the Thesis ...... 5 2 Background ...... 6 2.1 Discourse Relations ...... 6 2.2 Relation Signaling ...... 8 2.3 Rhetorical Structure Theory (RST) ...... 9 2.4 The RST Signalling Corpus (RST-SC) ...... 13 2.5 The Signal Anchoring Mechanism ...... 14 3 Methodology ...... 16 3.1 The Georgetown University Multilayer (GUM) Corpus . . . . . 16 3.2 Annotation Tool ...... 19 3.3 Annotation Procedure ...... 19 3.4 Annotation Scheme ...... 21 3.5 Annotation Reliability ...... 27 3.6 The Taxonomy of Discourse Signals ...... 27 3.7 Examples of Signal Anchoring ...... 32 4 Results and Analysis ...... 38 4.1 Overview ...... 38 4.2 Distribution of Signals regarding Relations ...... 39 4.3 Distribution of Signals across Genres ...... 47 4.4 Interim Conclusion ...... 51 5 Conclusion ...... 53 Bibliography ...... 55

vi List of Figures

2.1 A Graphical Representation of an RST Analysis...... 11 2.2 Hierarchical Taxonomy of Signals in RST-SC (Fragment)...... 14 2.3 A Visualization of the Signaling Annotation Scheme...... 14 3.1 A Visualization of How Strongly Each Genre Signals in the GUM Corpus. 17

3.2 A Nucleus-Satellite Restatement in GUM...... 18 3.3 A Multinuclear Restatement in GUM...... 18 3.4 Signal Annotation from RST-SC in the UAM Tool...... 20 3.5 A Visualization of the Annotation Interface in rstWeb...... 20 3.6 An Instance of Signal Annotation in rstWeb...... 20 3.7 Signals in a CCDT View...... 24 3.8 Signals in a Hierarchical View...... 25 3.9 A Visualization of Example (6)...... 26

4.1 Evaluation: Distribution of Signals...... 42 4.2 Justify: Distribution of Signals...... 42 4.3 An Example of Signaling Sequence Using World Knowledge. . . . . 45

vii List of Tables

2.1 Different Approaches to Discourse Relations Summarized by Stede [26].7 2.2 Classification of Subject Matter and Presentational Relations in RST. 12 3.1 RST Relations used in the GUM Corpus...... 18 4.1 Distribution of Unanchored Relations...... 39 4.2 Distribution of Signal Types and its Comparison to RST-SC...... 40 4.3 Distribution of Most Common Signals regarding Relations...... 41 4.4 Examples of Anchored Tokens across Relations...... 44 4.5 Distribution of Signaled Relations across Genres...... 48 4.6 Examples of Anchored Tokens across Genres...... 48

viii Chapter 1

Introduction

Sentences do not exist in isolation, and the meaning of a text or a conversation is not merely the sum of all the sentences involved. In other words, an informative text contains sentences whose meanings are relevant to each other rather than a random sequence of utterances. Moreover, some of the information in texts is not included in any one sentence but in their arrangement. Therefore, a high-level analysis is required in order to facilitate effective communication in discourse, which could benefit both linguistics study and NLP applications. For instance, an automatic discourse parser that successfully captures how sentences are connected in texts could serve tasks such as information extraction and text summarization. To be specific, a discourse is delineated in terms of relevance between textual elements. Linguists usually categorize such relevance into cohesion and coherence respectively. Cohesion refers to linguistic means that link one element to another in a local discourse environment such as connectives (e.g. ‘thus’, ‘but’, ‘afterwards’), related words (e.g. synonymy, meronymy and hyponymy), and pronouns (i.e. identi- fying a pronoun’s antecedent) etc. Coherence, on the other hand, refers to semantic or pragmatic linkages that hold between larger textual units in a discourse such as

Cause, , and Elaboration etc. Moreover, there are certain linguistic devices that systematically signal certain discourse relations: some are generic signals across the board while others are indicative of particular relations in certain domains.

1 For instance, consider the following example from the Georgetown University Multi- layer (GUM) corpus [32]1, in which the two textual units connected by the discourse

marker but form a Contrast relation, meaning that the contents of the two textual units are comparable yet not identical.

(1) Related cross-cultural studies have resulted in insufficient statistical power, but

interesting trends ( e.g., Nedwick, 2014 ). – Contrast [academic_implicature]

Generic signals are also ambiguous as they do not indicate strong associations with the relations they signal. For instance, there are three similar relations that can express adversativity: Contrast, Concession, and Antithesis. The relation Concession means that the writer acknowledges the claim presented in one tex- tual unit but still claims the presented in the other discourse unit while

Antithesis dismisses the former claim in order to establish or reinforce the latter. In spite of the differences in their pragmatic functions, these three relations can all be frequently signaled by the coordinating conjunction but: symmetrical Contrast as in (1), Concession as in (2), and Antithesis as in (3).

(2) The cardinal numbers included in this study were only one through five in order to avoid additional item variability, but larger numbers should be included in

future research. – Concession [academic_implicature]

(3) NATO had never rescinded it, but they had and started some remilitarization.

– Antithesis [interview_chomsky]

1The square brackets at the end of each example contain the document ID from which this example is extracted. Each ID consists of its genre type and one keyword assigned by the annotator at the beginning of the annotation task.

2 This introductory chapter provides an overview of the current project including its motivation, goals, and potential contribution to this line of research.

1.1 Motivation

Understanding discourse relations and their signaling information is a rewarding task since it can provide valuable linguistic insights into discourse parsing as well as lan- guage production and comprehension for psycholinguistic studies. The inventory of discourse relations varies within and across frameworks, as thoroughly demonstrated in Hovy and Maier [10]; and researchers could hardly reach an agreement on defining a standard set of relations. That said, a more fundamental and underlying question to ask is how these discourse relations are established in the first place. The answer to this question can in turn contribute to the classification of discourse relations. For instance, Knott and Dale [13] suggested a bottom-up approach for determining a set of relations by identifying the cue phrases associated with them. Likewise, Knott and Sanders [14] conducted a study on examining linguistic devices that are used to signal relations explicitly. However, what they identified as cue phrases have been limited to what we usually refer to as “discourse markers” today, namely coordinating conjunctions (e.g. ‘but’), subordinating conjunctions (e.g. ‘because’), and adverbials (e.g. ‘instead’). Due to the lack of adequate research on the signaling of discourse relations beyond DMs, Taboada and Das [27] undertook a pilot study exploring other textual signals besides DMs and provided a complete hierarchical taxonomy of discourse signals used in the RST Signalling Corpus (RST-SC, Das and Taboada [5]), which is built over the RST Discourse Treebank (RST-DT, Carlson et al. [2]). Similarly, the Penn Discourse Treebank 2.0 (PDTB, Prasad et al. [22]) also attempts to represent other

3 types of signals besides DMs called Alternative Lexicalizations. Despite their different theoretical groundings and approaches to relation signaling, both corpora only annotated the Wall Street Journal (WSJ) section of the Penn Treebank (PTB, Marcus et al. [19]), which only contains WSJ news articles. Moreover, as suggested by Taboada and Lavid [28] and Zeldes [34], some discourse signals are indicative of certain relations and genres. For instance, Taboada and Lavid [28] presented how to characterize appointment-scheduling dialogues using their rhetorical and thematic patterns as linguistic evidence and suggested that the rhetorical and the thematic analysis of their data can be interpreted functionally as indicative of this type of task-oriented conversation. Furthermore, the study of the classification of discourse signals can serve as valuable evidence to investigate their role in discourse as well as the relations they signal. One limitation of the RST Signalling Corpus is that no information about the location of signaling devices was provided (see Section 2.4 for details). As a result, Liu and Zeldes [15] presented an annotation effort to anchor discourse signals for both elementary and complex units on a small set of documents in RST-SC (see Section 2.5 for details). Therefore, the current project aims to fully uncover discourse signals at all levels in more genres to provide more richly annotated data for discourse parsing and empirical evidence for psycholinguistic research. More importantly, the current research aims to investigate the distribution of signaling information across different genres and provide analyses on genre-specific signals as well as generic ones.

4 1.2 Methodology

In order to substantiate the claims regarding genre-specific signaling devices, a corpus study is conducted on the Georgetown University Multilayer (GUM) corpus [32]. Then, a set of quantitative analyses of the annotated data will be conducted from dif- ferent angles. Specifically, each relation attested in this corpus study will be examined with its associated signaling devices. Secondly, the distribution of signals and their corresponding discourse relations within the same genre will be analyzed. Thirdly, the distribution of signals across genres will be studied to examine the variations of signals in each genre. Finally, the comparison of the distributions of signals between the RST Signalling Corpus and the GUM corpus will also be provided.

1.3 Organization of the Thesis

The rest of this thesis is organized as follows: Chapter 2 provides an overview of the concepts and theoretical framework underlying the current study. Chapter 3 delin- eates a detailed illustration of the corpus, the annotation tool, procedure, and relia- bility, and the adaptability of the taxonomy of discourse signals, and provides several concrete examples of signal anchoring for different types. Chapter 4 provides both qualitative and quantitative analyses regarding the distribution of signals across rela- tions and genres respectively. The thesis concludes with Chapter 5 by summarizing and pointing out the limitations of the current study and addressing some future directions.

5 Chapter 2

Background

This chapter provides an overview of the essential concepts associated with this study as well as the underlying theoretical grounding, Rhetorical Structure Theory (RST, Mann and Thompson [17]). The chapter is organized as follows: Section 2.1 illustrates the aspects of discourse relations. Section 2.2 discusses current research on relation signaling. Section 2.3 provides a brief overview of the theoretical framework used in this study. Finally, this chapter concludes with the introduction of the RST Signalling Corpus and the signal anchoring mechanism employed in this project.

2.1 Discourse Relations

Generally speaking, there is no debate on the existence of discourse relations. How- ever, there is basically no agreement among researchers on how exactly such relations are defined [14]. In other words, no unified set of discourse relations exists, and it is difficult to justify one set of relations over another. In fact, researchers have adopted different perspectives on investigating discourse relations, which are not necessarily completely disjoint. Stede [26] characterized those different perspectives as 6 cate- gories shown in Table 2.1. For a more detailed account of each of these perspective, interested readers can refer to the corresponding literature listed in Table 2.1. In particular, Hovy and Maier [10] conducted a comprehensive study on how many and which discourse structure relations we should use in analyzing discourse

6 Table 2.1: Different Approaches to Discourse Relations Summarized by Stede [26].

Approach to Recognizing Relations Representative Work & Theory Be skeptical and parsimonious Grosz and Sidner [9] Resort to insights from philosophy Kehler [11] Knott and Dale [13], Martin [20], Be inspired by the lexicon of your language PDTB, Prasad et al. [24] Be motivated by syntax and SDRT, Asher and Lascarides [1] Try to explain human cognition Kintsch [12], Sanders et al. [25] Be inspired by authentic texts RST, Mann and Thompson [17]

by collecting and merging relations from different sources and then proposing a set of relations organized into a taxonomy of specificity to ensure that more detailed or spe- cific relations can be properly blended into this classification when needed. They also proposed a solution to the organization of these relations, a functionally motivated hierarchy, which is defined in terms of the primary function the relations perform in text and is composed of three metafunctions: ideational (i.e. semantic), interpersonal and textual (i.e. presentational) [10]. Specifically, ideational relations such as Circum- stance, Contrast, and Elaboration are used to express states of affairs in the world, not including the interlocutors; interpersonal relations such as Antithesis, Justification and Motivation are expressed by various perlocutionary acts to affect readers’ attitude and beliefs etc; and textual or presentational relations such as Joint and Sequence are characterized by the juxtaposition presented in texts [10], as shown in the example below where the three juxtaposed segments are linked via the connectives First, Second and Finally. This taxonomy nicely represents the high-level discourse segment relations, which is also the backbone of the relations used in the GUM corpus. The current project will use the relation inventory in the GUM corpus since a corpus with different genres is needed for the purpose of this study.

7 (1) First, summary statistics of the study variables and racial categories were pro- duced. Second, we examined the relative proportions of the two discrimination experience measures across each racial category. Finally, we assessed the dis- tribution of reported reasons for discrimination across the racial categories. –

Sequence [academic_discrimination]

2.2 Relation Signaling

Relation signaling is a crucial and rewarding task since it can benefit both linguistics research and NLP applications. However, the first question to ask is what a signal is. In general, signals are the means by which humans identify the realization of discourse relations. The most typical signal type is discourse markers (DMs) (e.g. ‘although’, ‘nevertheless’) as they provide explicit and direct linking information between clauses and sentences. Moreover, several corpora have been exploring signaling information other than DMs such as the Penn Discourse Treebank (PDTB, Prasad et al. [22]), in which the lexicalized annotations have led to the discovery of a wide range of expressions called Alternative Lexicalizations (AltLex) [23] and the Rhetorical Structure Theory Signaling Corpus (RST-SC, Das and Taboada [5]), which is built over the RST Discourse Treebank (RST-DT, Carlson et al. [2]). In addition, the presence or absence of DMs makes discourse relations into explicit and implicit ones (e.g. Knott and Dale [13]; Taboada and Mann [30]). For instance, the two textual units in (2) are coherently connected by the DM while which signals a

Concession relation. The two sentences in (3) form an Antithesis relation with no obvious or explicit DMs present, and therefore this relation would be considered as an implicit relation by convention.

8 (2) While the Museum of Flight was in the top running, I’m disappointed that

NASA did not choose them. – Concession [news_nasa]

(3) At some point or another, most people realize that the world doesn’t revolve around them. Arrogant people counteract this by creating an atmosphere that revolves around them, and get angry if they’re reminded of the real world. –

Antithesis [whow_arrogant]

Intuitively, DMs are the most obvious and direct linguistic means that signal dis- course relations, and therefore a lot of research has been done on DMs. Nevertheless, focusing merely on DMs is inadequate as DMs can only account for a small number of relations in discourse. To be specific, Das and Taboada [4] reported that among all the 19,847 signaled relations (92.74%) in RST-SC (i.e. 385 documents and all 21,400 annotated relations), relations exclusively signaled by DMs only account for 10.65% of the relations whereas 74.54% of the relations are exclusively signaled by other sig- nals, corresponding to the types they proposed, which will be discussed in detail in Section 2.4.

2.3 Rhetorical Structure Theory (RST)

Rhetorical Structure Theory (RST, Mann and Thompson [17]) is a well-known theo- retical framework that extensively investigates discourse relations and is adopted by Das and Taboada (2017) and the present study. As listed in Table 2.1, RST was moti- vated by the authenticity of texts. It is a functional theory of text organization that identifies hierarchic structure in text. The original goals of RST were discourse anal- ysis and proposing a model for text generation; however, due to its popularity, it has

9 been applied to several other areas such as theoretical linguistics, psycholinguistics, and computational linguistics [29]. RST identifies hierarchic structure and nuclearity in text, which categorizes rela- tions into two structural types: Nucleus-Satellite and Multinuclear. The Nucleus-Satellite structure reflects a hypotactic relation whereas the Multi- nuclear structure is a paratactic relation [27]. Moreover, RST identifies textual units as Elementary Discourse Units (EDUs), which are non-overlapping, contiguous spans of text that relate to other EDUs [32]. EDUs can also form hierarchical groups known as complex discourse units. For instance, example (4) below is an excerpt from the GUM corpus, segmented into EDUs. The graphical representation of the RST analysis of (4) is provided in Figure 2.1. It shows that the text consists of five spans, which are represented in the diagram in Figure 2.1 by the cardinal numbers, 50, 51, 52, 53 and 54, respectively. In the diagram, a span being pointed to is referred to as the nucleus span, and a span being pointed from is referred to as the satellite span. Span 51 (satellite) is connected to Span 52 (nucleus) by a Purpose relation; and together they make the combined Span 51-52. Span 53 (nucleus) and Span 54 (nucleus) are in a multinuclear Joint relation, and together they make the combined Span 53-54. Span 51-52 (satellite) is connected to Span 53-54 (nucleus) by a Preparation relation, and together they make the combined Span 51-54, which is connected to Span 50 forming a Prepara- tion relation.

(4) [50] Demographic variables. [51] To provide information on the analytical sample as a whole, [52] two additional demographic variables are included. [53] First, age is a continuous measure created by subtracting the year of the respondents’ birth (obtained from Wave 1) from the year of the inter-

10 Figure 2.1: A Graphical Representation of an RST Analysis.

view at Wave 4. [54] Second, sex was dichotomously coded based on the self-reported sex of the respondent at Wave 4 (0 = female and 1 = male). [academic_discrimination]

The inventory of relations used in the RST framework varies widely, and there- fore the number of relations in an RST taxonomy is not fixed. The original set of relations defined by Mann and Thompson [17] included 23 relations, which fall into two types based on the rhetorical effect: subject matter relations (16) and presen-

11 tational relations (7), nicely corresponding to the taxonomy proposed by Hovy and Maier [10]. Subject matter relations mean that the intended effect comes from the readers’ (or speakers’) recognition of the relation in question whereas presentational relations realize the intended effect by altering the reader’s disposition. Table 2.2 below demonstrates the two types as well as their corresponding relations used in the original RST.

Table 2.2: Classification of Subject Matter and Presentational Relations in RST.

Subject Matter Presentational Circumstance Background Solutionhood Enablement Elaboration Motivation Volitional Cause Evidence Non-Volitional Cause Justify Volitional Result Antithesis Non-Volitional Result Concession Purpose Condition Otherwise Interpretation Evaluation Restatement Summary Sequence Contrast

The RST website (http://www.sfu.ca/rst) includes an extended set of 30 rela- tions. Moreover, in building corpora with discourse relations, different inventories of relations have been implemented. For instance, the RST Discourse Treebank (RST- DT, Carlson et al. [2]) uses a large set 78 relations which are divided into 16 relation groups.

12 2.4 The RST Signalling Corpus (RST-SC)

As demonstrated in 2.2, DMs alone cannot achieve a full picture of relation sig- naling, which is the motivation of investigating other types of signals, leading to the production of the RST Signalling Corpus by Das and Taboada [5]. According to their hierarchical taxonomy of discourse signals (see Figure 2.2 for an illustration, reproduced from Das and Taboada (2017:752)) [4], signals can be single, combined, multiple or unsure. A single signal indicates that the discourse relation is signaled by one and only one type of signal such as discourse markers (DMs), , lex- ical, semantics, morphological, syntactic, graphical, genre, and numerical; a combined signal means that two or more single signals are combined with each other in order to jointly signal the relation such as reference+syntactic, semantic+syntactic, and lexical+syntactic etc.; the category multiple means that a discourse relation can be signaled by different kinds of signals independently; and the category unsure is used to indicate that no signals seem to signal the relation clearly. Based on their analysis, Das and Taboada [4] found that 92.74% of the relations (i.e. 19,847 out of 21,400) in the RST-DT corpus are signaled. By examining the distribution of signaled relations (i.e. 19,847), they found that only 10.65% of the relations are exclusively signaled by DMs. On the other hand, 74.54% of the relations are solely signaled by signal types other than DMs. In addition, they observed that discourse relations are signaled by various signals. Since RST-SC does not support locating signal tokens, they were only be able to find the association at the level of single and combined and their corresponding types and subtypes. For instance, they found that a Contrast relation is commonly signaled by two semantic signals such as antonym and lexical chain, but the positions of individual tokens corresponding to these signal annotations in each instance were not annotated.

13 Figure 2.2: Hierarchical Taxonomy of Signals in RST-SC (Fragment).

2.5 The Signal Anchoring Mechanism

Figure 2.3: A Visualization of the Signaling Annotation Scheme.

As mentioned in Section 1.1 and unlike PDTB, RST-SC does not provide infor- mation about the location of discourse signals. Thus, Liu and Zeldes [15] presented an annotation effort to anchor signal tokens in the text, with six categories being anno- tated, as shown in Figure 2.3 (reproduced from Liu and Zeldes (2019:315)). Their

14 results showed that with 11 documents and 4,732 tokens, 923 instances of signals were anchored in the text, which accounted for over 92% of discourse signals, with the signal type semantic representing the most cases (41.7% of signaling anchors) whereas discourse relations anchored by DMs were only about 8.5% of anchor tokens in this study, unveiling the value of signal identification and anchoring.

15 Chapter 3

Methodology

3.1 The Georgetown University Multilayer (GUM) Corpus

The main goal of this project is to anchor and compare discourse signals across genres, which makes the Georgetown University Multilayer (GUM) corpus the optimal can- didate, in that it consists of eight genres including interviews, news stories, travel guides, how-to guides, academic papers, biographies, fiction, and forum discussions. Each document is annotated with different annotation layers including but not limited to dependency (dep), entity and (ref), and rhetorical structures (rst) etc. For the purpose of this study, the rst layer is used as it includes annotation on dis- course relations and signaling information will be anchored to it in order to produce a new layer of annotation. However, it is worth noting that other annotation layers are great resources to delve into discourse signals on other levels. Moreover, due to time limitations and the fact that this is the first attempt to apply the taxonomy of signals and the annotation scheme to other genres outside RST-DT’s newswire texts, only four genres in the GUM corpus were selected: academic, how-to guides, interviews, and news, which include a collection of 12 documents annotated for discourse relations. The rationale for choosing these genres is as follows. According to Zeldes [33]’s neural approach to discourse relation signaling on the GUM corpus, in which a Bi-LSTM/CRF architecture was employed to predict signals, how-to guides and academic articles in the GUM corpus signal most strongly, with interviews and

16 Figure 3.1: A Visualization of How Strongly Each Genre Signals in the GUM Corpus.

news articles slightly below the average and fiction and reddit texts the least signaled, as shown in Figure 3.1 (reproduced from Zeldes (2018:19)) [34]. Moreover, the genre interview was chosen because it is a type of spoken language, which is different from the other three written texts. As a result, these four genres could be a good starting point of the topic under discussion. The annotation was done using a discourse signal annotation system built on rstWeb [31], a web application that allows collaborative, online annotation of RST trees. As for the inventory of relations used in this corpus, a smaller set than RST-DT was chosen in that GUM is built in a course with limited time due to the course duration, as illustrated in Zeldes [32]. However, this smaller set retains the higher- level clusters and includes relations frequently used in other corpora to maintain the effect exemplified by a large finer-grained set. Specifically, there are 20 relations

17 Table 3.1: RST Relations used in the GUM Corpus.

Satellite-Nucleus Multinuclear Antithesis, Background, Cause, Contrast Circumstance, Concession, Joint Condition, Elaboration, Sequence Evaluation, Evidence, Justify, Restatement Motivation, Preparation, Result, Purpose, Restatement, Solutionhood

Figure 3.2: A Nucleus-Satellite Figure 3.3: A Multinuclear Restatement Restatement in GUM. in GUM.

used in the GUM corpus and grouped into two structural types consisting of 16

Nucleus-Satellite and 4 Multinuclear relations, as shown in Table 3.1. Note that the relation Restatement is considered distinct in these two structural types. A nucleus-satellite Restatement means that only partial content in the nucleus span is reiterated by the satellite span, as shown in Figure 3.2; a multinuclear Restatement is used in other circumstances when neither part of the restatement is clearly more prominent, as shown in Figure 3.3.

18 3.2 Annotation Tool

One of the reasons that caused low inter-annotator agreement (IAA) in Liu and Zeldes [15] is the inefficient and error prone annotation tools they used: no designated tools were available for the signal anchoring task at the time. As a result, annotators had to use a tabular grid based interface called GitDox [35], in addition to UAM CorpusTool [21] used in the RST Signalling Corpus (see Figure 3.4), which shows signal type annotations applied to relations but without marking associated tokens in the text. We therefore developed a better tool tailored to the purpose of the annotation task: A Discourse Signal Annotation System for RST Trees [7]. Figure 3.5 provides a demonstration of what the interface is like in rstWeb. Specifically, each relation is accompanied with a "S" button. Once clicked on, a sidebar opens in which annotators can choose the type and subtype of the signal and highlight the corresponding tokens; the number of signals associated with the relation is also indicated on the button. For instance, in Figure 3.6, a Cause relation is signaled by the discourse marker because, and there is only one signal anchored for this relation.

3.3 Annotation Procedure

The annotation process consists of signal identification and anchoring. For each dis- course relation, the annotator searches for possible signals. For each signal, the fol- lowing categories are annotated: signal type and signal subtype. Once signal identifi- cation is done, signal anchoring will be performed by highlighting associated tokens in the text.

19 Figure 3.4: Signal Annotation from RST-SC in the UAM Tool.

Figure 3.5: A Visualization of the Annotation Interface in rstWeb. Figure 3.6: An Instance of Signal Annotation in rstWeb.

20 3.4 Annotation Scheme

Since most of the signals anchored in texts are open-class, it is of great difficulty to achieve a high agreement between annotators. Many decisions must be made in terms of what and how many tokens should be selected. Moreover, these decisions also take technical practices (e.g. computational complexities) into account. The rest of this section will discuss the rationales behind this annotation scheme by delineating several particular cases.

3.4.1 Annotation of Syntactic Signals

For syntactic signals, one of the questions we are exploring is which of these are actu- ally attributable to sequences of tokens, and which are not. For example, sequences of auxiliaries or constructions like imperative clauses might be identifiable, but more implicit and versatile syntactic constructions are not such as the ellipsis. In addition, one of the objectives of the current project is to provide human annotated data in order to see how the results produced by machine learning techniques compare to humans’ judgments. In particular, we are interested in whether or not contemporary neural models have a chance to identify the constructions that humans use to rec- ognize discourse relations in text based on individual sequences of word embeddings. The underlying idea of of word embeddings, “You shall know a word by the company it keeps”, was popularized by Firth (1957:11) [6]. It is a language modeling technique that converts words into vectors of real numbers that are used as the input represen- tation to a neural network model based on the idea that words that appear in similar environments should be represented as close in vector space. For instance, Zeldes [33] employed a Recurrent Neural Network (RNN) model to learn about connectives and other signaling devices by quantifying the importance of each word in a given

21 sequence for the task of relation classification. Specifically, a bi-LSTM (bidirectional Long-Short Term Memory) feeding into a CRF decoder (Conditional Random Fields) architecture (i.e. bi-LSTM/CRF) was used to identify the words most attended to by the neural model (see Greff et al. [8] and Ma and Hovy [16] for technicalities of the model; see Zeldes (2018:181-183) [33] for some initial results). It is worth noting that the network used in Zeldes [33] is not learning to detect signals by training on signaling annotation but rather learning how to recognize rela- tions and outputting signals as a by-product. Therefore, it is relatively easy to capture discourse markers such as ‘then’ or a relative pronoun ‘which’ signaling an Elabora- tion. The challenge is to figure out what features the network needs to know about beyond just word forms such as meaningful repetitions and versatile syntactic con- structions. With the human annotated data from the current project, it is hoped that more insights into these aspects can help us engineer meaningful features in order to build a more informative computational model.

3.4.2 Annotation of Signals for Relations between Intermediate Spans

The current annotation scheme only marks the repetitive cases relevant to the instance of the relation in question. This is to ensure that the set of possible repetitive instances is arguably a closed set. Nonetheless, for higher-level relations such as Preparation and Elaboration, whether or not the tokens annotated for lower-level relations should be doubly annotated for relations corresponding to intermediate spans is a hard decision to make. Broadly speaking, there are two approaches to this problem: a Compositionality Criterion for Discourse Trees (CCDT) view proposed by Marcu [18] and a hierarchical view, an essential characteristic of RST. CCDT hypothesized that if a relation holds between two blocks of EDUs, then it also holds between their

22 head EDUs. In our case, this means that the repetitive object or entity signaling the relation in question is only marked in the head EDUs. For instance, as shown in

Figure 3.7, one of the signals indicating the relation Elaboration is the repetition of the person Sarvis. A CCDT approach means that only the occurrence in the head EDUs are marked - that is, Robert Sarvis in Span 3 (because Span 3 is the head EDU of the complex units 3-7) and Sarvis in Span 8 are annotated. On the other hand, if we take a hierarchical view, then all instances of Sarvis should be marked in addition to the one in the head EDU (i.e. Sarvis in Span 4 and Span 7) since they are constituents of Span 3-7, as shown in Figure 3.8. However, this could possibly result in overgenerating spurious signals, which might increase the complexity of computation. The current annotation scheme follows CCDT.

3.4.3 Annotation of Multinuclear Relations

A dilemma that generally came up during the discussion about signal anchoring was whether or not to mark the first constituent of a multinuclear relation. For instance, as can be seen in Figure 3.9, which is the visualization of example (6), the first Joint relation is left unsignaled/unmarked, and the other Joint relations are signaled. The rationale is that when presented a parallelism, the reader only notices it from the second instance. As a result, signals are first looked for between the first two spans, and then between the second and the third etc. If there is no signal between the second and the third spans, then try to find signals in the first and the third spans. Because this is a multinuclear relation, transitivity does exist between spans. Moreover, the current approach is also supported by the fact that a multinuclear relation is often found in the structure like X, Y and Z, in which the discourse marker and is between the last two spans, and thus this and is only annotated for the relation between the

23 Figure 3.7: Signals in a CCDT View.

24 Figure 3.8: Signals in a Hierarchical View.

25 Figure 3.9: A Visualization of Example (6).

last two spans but not between the first two spans. However, the problem with this approach is that the original source for the parallelism cannot be located.

3.4.4 One type vs. Separate Types

It is likely that a relation is indicated by several identical signal types and subtypes but associated with different tokens. Therefore, the question whether all tokens are annotated under one group of type and subtype or whether to enumerate every dif- ferent instance of the same type and mark them as different signals is raised. The current annotation scheme employs the latter option from a technical point of view. However, one may argue that this approach could overgenerate signal types and sub- types.

26 3.5 Annotation Reliability

In order to assess the reliability of the scheme, a revised inter-annotator agreement study was conducted using the same metric and with the new interface on three documents from RST-SC, containing 506 tokens with just over 90 signals. Specifically, agreement is measured based on token spans. That is, for each token, whether the two annotators agree it is signaled or not. The results demonstrate an improvement in Kappa, 0.77 as opposed to the previous Kappa 0.52 in Liu and Zeldes [15].

3.6 The Taxonomy of Discourse Signals

The most crucial task in signaling annotation is the selection of signal types. The taxonomy of discourse signals used in this project will follow the work developed by Das and Taboada [4] with additional types and subtypes being proposed in order to better adapt the scheme to other genres. To be specific, two new types and four new subtypes of the existing types are proposed. The two new types are Visual and Textual in which the subtype of the former is Image and the subtypes of the latter are Title, Date, and Attribution. The three new subtypes are Modality under the type Morphological and Academic article layout, Interview layout and Instructional text layout under the type Genre. Their definitions are listed below followed by examples selected from the current project1.

1. Visual Features

(a) Image Definition: The first span is elaborated via image(s) present in the second span. E.g.,

1Types and subtypes in boldface are newly proposed.

27 • [File image of Space Shuttle Atlantis lifting off, which approaches its

2 last mission before retirement.]N [Image: NASA.] S – Elaboration [news_nasa]

2. Textual Features Some people may argue that textual features presented below should not be considered as separate categories as they more or less overlap with the subtypes of the genre features. However, the current study proposes that these categories are signals in their own rights in that they are not indicative of genres all the time. For instance, reading the title alone in example (2a) below does not guar- antee that this is an academic article; it could also be a news article. Likewise, a date alone in example (2b) cannot ensure that this is a news article; it could also be a personal diary. Nevertheless, this is not to say that all titles, dates, and attribution information are irrelevant for genres. For instance, example (2c) is a borderline case. The email addresses as well as the institution information followed by the authors’ names make this text more likely to be an academic article than a news article. However, if we only see the authors’ names, it will become harder to determine the genre. Conversely, when reading an academic paper, we know the convention of introducing authors and affiliations and we interpret the text accordingly.

(a) Title Definition: One EDU or a group of EDUs serving as the satellite span

2Conventions for interpreting the examples: The text within square brackets denotes an EDU. The subscript N stands for nucleus, and the subscript S stands for satellite. When a multinuclear relation is present, the subscripts n1 and n2 are used. A pair of EDUs is respectively followed by a dash and the name of the discourse relation that holds between the EDUs. As previously mentioned, the square brackets at the end of each example contain the document ID that consists of its genre and one key word.

28 (simple or complex) contains the title of the text that prepares readers for the rest of the text. E.g.,

• [Digital Humanities Clinics - Leading Dutch Librarians into

DH]S [... In 2015, an initiative was started to set up a Dutch speaking DH+Lib community in the Netherlands and Belgium, based on the example of the American communal space of librarians, archivists, LIS graduate students, and information specialists to discuss topics ‘Where

3 the Digital Humanities and Libraries meet’. ...] N – Preparation [academic_librarians]

(b) Date Definition: A satellite span contains a date that specifies circumstances (i.e. time) of the nucleus span. E.g.,

• [Thursday, February 23, 2006]S [Almost half of all Australian primary school children are mild to moderately iodine deficient,

researchers say. A new study ...]N – Circumstance [news_iodine]

(c) Attribution Definition: A satellite span contains information about the authors of the text to understand the nucleus span. E.g.,

• [Michiel Cock [email protected] Vrije Universiteit Amsterdam, the Netherlands; Lotte Wilms [email protected] National Library of the

Netherlands, the Netherlands]S [In 2015, an initiative was started to set up a Dutch speaking DH+Lib community in the Netherlands and Belgium, based on the example of the American communal space of

3The ellipsis used in the example indicates the omission of the text as EDUs containing the title are usually connected to the rest of the text, which is impossible to present here.

29 librarians, archivists, LIS graduate students, and information special- ists to discuss topics ‘Where the Digital Humanities and Libraries

meet’. ...]N – Background [academic_librarians]

3. Morphological Features Morphological features only contain one subtype in the original taxonomy of discourse signals proposed by Das and Taboada [4], the tense feature, which refers to a change of tense, aspect or mood between the relevant clauses in the respective spans, according to the RST-SC Annotation Manual [3]. However, the nomenclature seems misleading in that linguistically speaking, tense, aspect, and mood are three distinct categories. Moreover, as shown in this project, modal verbs themselves are useful signals that do not necessarily involve a change between EDUs in respective spans. As a result, the modality feature is added and separated from the tense feature to better demonstrate its effect in discourse.

(a) Modality Definition: The modality feature refers to the use of a modal verb in one span that expresses the necessity or possibility of the proposition(s) in the other span(s). E.g.,

• [You can write your jokes down on index cards to keep them handy

or use a document file on your computer.]N [The latter option may

allow for easier revision.]S – Motivation [whow_joke]

4. Genre Features There are three new subtypes added to the genre features as the taxonomy is adapted to three new genres, all of which have their distinct characteristics.

30 In addition, as can be seen from the following examples, some genre-related features can be anchored to tokens while some cannot.

(a) Academic article layout Definition: Textual or visual features that help to understand the organi- zation of an academic article such as the section headings, body of text, and information about the author(s)). E.g.,

• [5. Conclusion.]S [Albeit limited, these results provide valuable insight into SI interpretation by Chitonga-speaking children and demonstrate that pragmatic inference acquisition likely follows the order identified in previous research, but appears to be completed at a

later age in this language. ...]N – Preparation [academic_implicature]

(b) Interview layout Definition: Textual, visual or conversational features that help to identify the structure of an interview (e.g. the turn between speakers in a dialogue). E.g.,

• [We’ve seen mass strikes all around the world, in countries that we wouldn’t expect it. Do you think this is a revival of the Left in the

West? Or do you think it’s nothing?]S [It’s really hard to tell. I mean there’s certainly signs of it, and in the United States too, in fact we had a sit down strike in the United States not long ago, which is a

very militant labor action. ...]N – Solution [interview_chomsky]

(c) Instructional text layout Definition: Textual or visual features that help to understand the organi- zation of an instructional text.

31 • [Steps]S [Wash alone or with "like" clothing. It’s best to wash adults’ overalls alone, especially men’s . However, it is okay to wash just a few

items with them, like blue jeans. ...]N – Preparation [whow_overalls]

3.7 Examples of Signal Anchoring

This section provides several examples of anchoring discourse signals in different rela- tions that correspond to different types of signals as well as instances of unsignaled and unanchored signals. Certain decisions made here have been explained in detail in Section 3.4, though they are open for discussion.

3.7.1 Reference Features

According to the RST-SC Annotation Manual [3], the type Reference is defined as pronouns and other referential expressions serving as signals. There are four types of reference: personal reference includes pronouns, determiners, and possessive pronouns; demonstrative reference such as demonstrative determiners, demonstrative pronouns, and demonstrative adverbs (i.e. here, there, now, then), comparative reference, and propositional reference. For instance, in the following example (1), one of the signals indicating the relation Justify is the pronouns them and they in the satellite span resolved by the antecedent arrogant people in the nucleus span. Thus, these three token spans are anchored.

(1)[ Arrogant people have an extremely strong need to look good.]N [When you make them look bad - even if it is the slightest offense - they will usually be

very mad at you.]S – Justify [whow_arrogant]

32 3.7.2 Lexical Features

According to the RST-SC Annotation Manual [3], lexical features include the use of indicative words or phrases (i.e. indicative word) and short tensed clauses (i.e. alternate expression) such as that is, that means, the result is that etc. For instance, the relation Justify between the nucleus and satellite spans is indicated by three lexical signals, all belonging to the subtype indicative word and corresponding to the tokens noteworthy, legacy, and outstanding respectively.

(2) [In the end, these choices provide the greatest number of people with the best opportunity to share in the history and accomplishments of NASA’s

remarkable Space Shuttle Program.]N [These facilities we’ve chosen have a noteworthy legacy of preserving space artifacts and providing outstanding

access to U.S. and international visitors.]S – Justify [news_nasa]

3.7.3 Semantic Features

Semantic features are complex in nature due to their various subtypes and their vari- ability.. According to the RST-SC Annotation Manual [3], if an entity is introduced in one span, and the entity (or an equivalent referring expression) is repeated in the other span, then this is a case of repetition, and therefore all instances of the entity relevant to the relation in question will be anchored in the text. On the other hand, lexical chain is more complicated. In general, lexical chains are annotated for words with the same lemma or words or phrases that are semantically related. Moreover, objects or entities annotated for the feature repetition are excluded from lexical chains. Another characteristic of lexical chains is that words or phrases annotated as lexical chains are open to different syntactic categories. For instance, the following example shows that the relation Restatement is signaled by two different signals: one is the

33 repetition of the pronoun they in both spans, and the other is a lexical chain item corresponding to the phrase a lot of in the nucleus span and quantity in the satellite span respectively.

(3)[ They compensate for this by creating the impression that they have a lot of

friends –]N [they have a ‘quantity, not quality’ mentality.]S – Restatement [whow_arrogant]

Another frequently anchored semantic feature is meronymy, which means that words or phrases in respective spans are in a meronymy relationship, according to the RST- SC Annotation Manual [3]. That is, a collective noun in one span is instantiated by one or more concrete instances in the other span. For instance, the noun phrase The effects in the satellite span is represented by four concrete instances in the nucleus span: loss of IQ, learning difficulties, hearing difficulties, and other neurological problems.

(4)[" The effects of iodine deficiency are dependent upon how severe it is and

when it occurs.]S [So if we go to the pregnant woman, she doesn’t get enough iodine, she won’t make enough thyroid hormone, and the foetus won’t get the amount of thyroid hormone it needs for adequate and proper development of the brain, so you’ll then see consequences being loss of IQ, learning diffi- culties, hearing difficulties and other neurological problems," Professor

Eastman said.]N – Background [news_iodine]

3.7.4 Morphological Features

Morphological features consist of the existing feature tense and the newly proposed feature modality. It is not necessary to automatically annotate every verb in the source and target spans - only occurrences of tenses or modal verbs that matter for

34 the relations being signaled should be annotated. In particular, tenses and modal verbs in relative clauses are often not relevant to tense signals affecting the main clause. Moreover, for all tense/aspect/mood signals, the entire verbal complex should be annotated, creating parity between simple lexical verbs, periphrastic tenses, and passives. As shown in the following example, the phrase will take the place of indi- cating a future tense is expressed by the auxiliary will and a light verb construction take the place of in the nucleus span whereas the phrase has been retired in the satel- lite span involves a periphrastic tense as well as a passive, and therefore both the auxiliary verbs has and been and the lexical verb retired are annotated.

(5) [Space Shuttle Discovery will take the place of Enterprise at the Udvar-

Hazy Center.]N [Discovery has already been retired following the completion

of STS-133 last month, its 39th mission.]S – Background [news_nasa]

3.7.5 Syntactic Features

According to the RST-SC Annotation Manual [3], syntactic features include 9 different subtypes: relative clause, infinitival clause, present participial clause, past participial clause, imperative clause, interrupted matrix clause, parallel syntactic construction, reported speech, subject auxiliary inversion, and nominal modifier. These subtypes are typically straightforward and thus easy to annotate. For instance, relative pronouns (i.e. which, that, who etc.) are annotated for instances of relative clause. However, when a relative pronoun is absent, the signal remains unanchored. Gerund and past participial forms of the verbs are annotated for instances of present/past participial clause. The subtype parallel syntactic construction is usually used in a combined signal parallel syntactic construction + lexical chain. As such, identical or semantically related spans of tokens and the following lexical item(s) are annotated. For instance,

35 there is a Joint relation between each span in the following example. The first three tokens in each span form a parallel construction and are semantically related at the same time.

(6) [5] Sociologists have explored the adverse consequences of discrimination; [6] psychologists have examined the mental processes that underpin con- scious and unconscious biases; [7] neuroscientists have examined the neu- robiological underpinnings of discrimination; and [8] evolutionary theorists have explored the various ways that in-group/out-group biases emerged

across the history of our species. – Joint [academic_discrimination]

3.7.6 Genre Features

So far we have been looking at signaled and anchored signals. However, there are certain signals that may not be anchored to tokens in texts such as the genre features. This is not to claim, however, that all genre features are not anchorable, as shown and mentioned in Section 3.6.

3.7.7 Graphical Features

Graphical features refer to punctuation marks that are considered useful signals such as colons, semicolons, dashes, and parentheses or a numbered list or bullet items occurring in a sequential order [3]. For instance, in addition to the lexical signals of the relation Restatement mentioned in example (3) above, another signal is the dash (–) at the end of the nucleus span, as shown in example (7) below.

36 (7) [They compensate for this by creating the impression that they have a lot of

friends –]N [they have a ‘quantity, not quality’ mentality.]S – Restatement [whow_arrogant]

3.7.8 Numerical Features

There is only one subtype under this category called same count, meaning that the number of the objects, entities or represented by a word (e.g., five, two) in one span is equal to the numerical count of those objects, entities or propositions present in the other span. For instance, in the following example, the word Two in the satellite span is equal to the numerical count of the propositions in the nucleus span, which contains two nucleus marked as n1 and n2 in a multinuclear relation.

(8)[ Two Parts:]S [[Getting the Material Right]n1 [Getting the Delivery

Right]n2]N – Preparation [whow_joke]

37 Chapter 4

Results and Analysis

4.1 Overview

This pilot study annotated 12 documents with 11,145 tokens across four different genres selected from the GUM corpus. Academic articles, how-to guides and news are written texts while interview is spoken language. Generally speaking, all 20 rela- tions used in the GUM corpus are signaled and anchored. However, this does not mean that all occurrences of these relations are signaled and anchored. For instance, the first occurrence of a multinuclear relation is not signaled and anchored due to the design of the annotation scheme (see Section 3.4 for more details). Additionally, there are several signaled but unanchored relations, as shown in Table 4.1. In par- ticular, the 5 unsignaled instances of the relation Joint result from the fact that they are the first occurrence of the multinuclear relation Joint and therefore are not annotated, as explained in Section 3.4.3. Additionally, these unanchored signals are usually associated with high-level discourse relations and usually correspond to genre features such as interview layout in interviews where the conversation is constructed as a question-answer scheme and thus rarely anchored to tokens. With regard to the distribution of the signal types found in these 12 documents, the 16 distinct signal types amounted to 1263 signal instances, as shown in Table 4.2. There are only 204 instances of DMs out of all 1263 annotated signal types (16.15%) as opposed to 1059 instances (83.85%) of other signal types. In the RST Signalling

38 Table 4.1: Distribution of Unanchored Relations.

unanchored frequency percentage (%) relations Preparation 28 22.2 Solutionhood 11 32.35 Joint 5 1.92 Background 3 2.68 Cause 1 4 Evidence 1 4.2 Motivation 1 4.76

Corpus, DM accounts for 13.34% of the annotated signal instances as opposed to 81.36%1 of other signal types [4]. The last column in Table 4.2 shows how the distri- bution of each signal type found in this pilot study compares to RST-SC. The reason why the total percentage of the last column does not amount to 100% is that not all the signal types found in RST-SC are present in this study such as the combined signal type Graphical + syntactic. And since Textual and Visual are first proposed in this study, no results can be found in RST-SC. Moreover, the category Unsure2 used in RST-SC is excluded from this project.

4.2 Distribution of Signals regarding Relations

Table 4.3 provides the distribution of discourse signals with regard to the relations they signal. The first column lists all the relations used in the GUM corpus. The second column shows the number of signal instances associated with each relation. The third and fourth columns list the most signaled and anchored type and subtype

1This result excludes the category Unsure used in RST-SC. 2The category Unsure means that no potential signals were found or were specified.

39 Table 4.2: Distribution of Signal Types and its Comparison to RST-SC.

signal_type frequency percentage (%) RST-SC (%) Semantic 563 44.58 24.8 DM 204 16.15 13.34 Lexical 156 12.35 3.89 Reference 71 5.62 2.00 Semantic + syntactic 51 4.04 7.36 Graphical 46 3.64 3.46 Syntactic 44 3.48 29.77 Genre 30 2.38 3.22 Morphological 26 2.06 1.07 Syntactic + semantic 25 1.98 1.40 Textual 24 1.90 N/A Numerical 8 0.63 0.09 Visual 7 0.55 N/A Reference + syntactic 3 0.24 1.86 Lexical + syntactic 3 0.24 0.41 Syntactic + positional 2 0.16 0.23 Total 1263 100.00 92.9

respectively. DMs are the most frequent signals for five of the relations – Condition, Concession, Antithesis, Cause, and Circumstance – which only account for a small portion of the relations. The results also show a very strong dichotomy of relations signaled by DMs and semantic-related signals: the rest of the relations are all most frequently signaled by the type semantic or lexical, which, broadly speaking, are all associated with open-class words as opposed to functional or syntactic categories in DMs. Furthermore, the type Lexical and its subtype indicative word seem to be indicative of the relations Justify and Evaluation, as shown in Figure 4.1 and 4.2. This makes sense due to the nature of the relations, which requires writers’ or speakers’ opinions or inclinations for the subject under discussion, which are usually expressed

40 Table 4.3: Distribution of Most Common Signals regarding Relations.

signaled signal signal signal relations instances type subtype Joint 260 Semantic (147) lexical chain (96) Elaboration 243 Semantic (140) lexical chain (96) Preparation 129 Semantic (54) lexical chain (30) Background 112 Semantic (62) lexical chain (42) Contrast 68 Semantic (39) lexical chain (31) Restatement 60 Semantic (34) lexical chain (28) Concession 49 DM (23) DM (23) Justify 49 Lexical (25) indicative word (23) Evaluation 42 Lexical (31) indicative word (31) Solutionhood 34 Semantic (12) lexical chain (5) Condition 31 DM (25) DM (25) Antithesis 31 DM (12) DM (12) Sequence 26 Semantic (7) lexical chain (6) Cause 25 DM (12) DM (12) Evidence 24 Semantic (8) lexical chain (7) Result 21 Semantic (8) lexical chain (7) Motivation 21 Semantic (8) lexical chain (7) Purpose 21 Syntactic (9) infinitival clause (7) Circumstance 20 DM (11) DM (11)

through positive or negative adjectives (e.g. serious, outstanding, disappointed) and other syntactic categories such as nouns/noun phrases (e.g. legacy, excitement, an unending war) and verb phrases (e.g. make sure, stand for). Likewise, words like Tips, Steps, Warnings are indicative items to address communicative needs, which is specific to a genre, in this case, the how-to guides. It is also worth pointing out that

Evaluation is the only discourse relation that is not signaled by any DMs in this dataset. Another way of seeing these signals is to examine their associated tokens in texts, regardless of the signal types and subtypes. Table 4.4 lists some representative,

41 Figure 4.1: Evaluation: Distribution of Signals.

Figure 4.2: Justify: Distribution of Signals.

42 generic/ambiguous (in boldface), and coincidental (in italic) tokens that correspond to the relations they signal. Each item is delimited by a comma; the & symbol between tokens in one item means that this signal consists of a word pair in respective spans. The number in the parentheses is the count of that item attested in this project; if no number is indicated, then that token span only occurs once. The selection of these single-occurrence items is random in order to better reflect the relevance in

contexts. For instance, lexical items like Professor Eastman in Joint, NASA, IE6, Professor Chomsky in Elaboration, Bob McDonnell in Background, and NATO in Restatement appear to be coincidental because they are the topics or subjects being discussed in the articles. These results are parallel to the results demonstrated in Zeldes (2018:180), which employed a frequency-based approach to show the most distinctive lexemes for some relations in GUM [33].

Even though some relations are frequently signaled by DMs such as Condition, Antithesis, and Joint, most of the signals are highly lexicalized and indicative of the relations they indicate. For instance, signal tokens associated with the relation

Restatement tend to be the repetition or paraphrase of the token. Likewise, most of the tokens associated with the relation Evaluation are strong positive or negative expressions. As for the relation Sequence, in addition to the indicative tokens such as First & Second and temporal expressions such as later, an indicative word pair can also suggest sequential relationship as in example (1). More interestingly, world knowledge such as the order of the presidents of the United States of America (i.e that Bush served as the president of the United States before Obama) is also a indicative signal for Sequence, as shown in Figure 4.3.

43 Table 4.4: Examples of Anchored Tokens across Relations.

relations examples of anchored tokens Joint ; (16), and (15), also (10), Professor Eastman (3), he (3), they (2) Image (6), based on (3), – (3), NASA (3), IE6 (3), More specifically (2), Elaboration Additionally (2), also (2), they (2), it (2), Professor Chomsky (2) Preparation : (6), How to (2), Know (2), Steps (2), Getting (2) Background Therefore, Indeed, build on, previous, Bob McDonnell, Looking back but (9)/But (4), or (2), Plastic-type twist ties & paper-type twist ties, Contrast in 2009 & today, deteriorate & hold up, however, bad & nice, yet They & they (2), NATO (2), In other words, realistic & real, and, Restatement rehashed & retell, it means that, Microsoft & Microsoft but (10), However (3), The problem is (2), though (2), at least, While, Concession It is (also) possible that, however, best & okay, Albeit, despite, if, still because (2), an affront & disappointed deeply, excitement, share, Justify the straps, The is that, any reason, so, since, confirm, inspire very serious, nationally representative, a frightening idea, a true win, Evaluation an important addition, issue, This study & It, misguided, pain Solutionhood Well (2), arrogant, :, So, why, and, Darfur, How, I think, Determine Condition If (12)/if(10), even if, unless, depends on, –, once, when, until but (5)/But, instead of (2), In fact, counteract, won’t, rather than, Antithesis Or, not, the Arabs, however, better & worst and (3), First & Second, examined & assessed, later, Bush & Obama, Sequence initial, digital humanities, A year later, stop & update because (3), suggests, due to, compensate for, as, since/Since, Cause arrogant people, in turn, given, brain damage, as such ( ) (2), see (2), According to, because, as, –, and, Arabs & Turkey, Evidence Because of, discrimination, biases, The report states that, Thus so (3), and (2), meaning (2), so that, capturing, thus, putting, Result the χ2 statistic, make Motivation will (2), easier, the pockets, All it takes is, so, last longer Purpose to (6)/To, in order to (3)/In order to (2), so (2), enable, The aim when (4)/When (2), On March 13, Whether, As/as, With, Circumstance in his MIT office, the bigger & the harsher

44 Figure 4.3: An Example of Signaling Sequence Using World Knowledge.

(1) [On a new website, "The Internet Explorer 6 Countdown", Microsoft has

launched an aggressive campaign to persuade users to stop using IE6]n1 [and

update to a newer IE.]n2 – Sequence [news_ie9]

In addition, it is also worth noting that some tokens are frequent signals in several relations, which makes the use of them very ambiguous. For instance, the coordinating conjunction and appears in Joint, Restatement, Sequence, and Result; sim- ilarly, the coordinating conjunction but appears in Contrast, Concession, and Antithesis. The subordinating conjunctions since and because serve as signals of Justify, Cause, and Evidence. These would pose difficulties to the

45 validity of discourse signals. As pointed out by Zeldes [33], a word like and is extremely ambiguous overall, since it appears very frequently in general, and is attested in all discourse functions. However, it is noted that some ‘and’s are more useful as signals than others: adnominal ‘and’ (example (2)) is usually less interesting than intersen- tential ‘and’ (example (3)) and sentence initial ‘and’ (example (4)).

(2) The owners, [William and Margie Hammack], are luckier than any others.3 –

Elaboration-Additional

(3) [Germany alone had virtually destroyed Russia, twice,]n1 [and Germany backed by a hostile military alliance, centered in the most phenomenal mil-

itary power in history, that’s a real threat.]n2 – Joint [interview_chomsky]

(4) [Take say on Obama, Obama’s national security advisor James Jones former Marine commandant is on record of favoring expansion of NATO to the south and the east, further expansion of NATO, and also making it an intervention

force.]n1 [And the head of NATO , Hoop Scheffer, he has explained that NATO must take on responsibility for ensuring the security of pipelines and sea lanes,

that is NATO must be a guarantor of energy supplies for the West.]n2 – Joint [interview_chomsky]

Hence, it would be beneficial to develop computational models that score and rank signal words not just based on how proportionally often they occur with a relation, but also on how (un)ambiguous they are in . In other words, if there are clues

3The GUM Corpus’s guidelines do not segment adnominal ‘and’; as a result, no such examples can be found. Although RST-DT does not segment adnominal ‘and’ either, this example is chosen from the RST-DT corpus [2] for illustration due to the apposition. Note that the relation inventory also differs.

46 in the environment that can tell us to safely exclude some occurrences of a word, then those instances shouldn’t be taken into consideration in measuring its ‘signalyness’.

4.3 Distribution of Signals across Genres

One of the main objectives of the current project is to investigate the distribution of discourse signals across genres. Thus, this section tries to identify genre-specific signals and tokens. First of all, Table 4.5 shows the distribution of the signaled relations in different genres. Specifically, the number preceding the vertical line is the number of signals indicating the relation and the percentage following the vertical line is the corre-

sponding proportional frequency. The label N/A suggests that no such relation is present in the sample from that genre. As can be seen from Table 4.5, how-to guides involve the most signals (407 instances of signals), followed by interviews, academic articles, and news. It is surprising to see that news articles selected from the GUM corpus are not frequently signaled as they are in the RST Signalling Corpus, which could be attributed to two reasons. Firstly, the source data is different. The news articles from the GUM corpus are from Wikinews while news articles from RST- SC are Wall Street Journal articles. Secondly, RST-DT has finer-grained relations (i.e. 78 relations as opposed to the 20 relations used in GUM) and segmentation guidelines, thereby having more chances for signaled relations. Moreover, it is clear

that Joint and Elaboration are the most frequently signaled relations in all four genres across the board, followed by Preparation in how-to guides and interviews or Background in academic articles and news, which is expected in that these four relations all perform high-level representations of discourse that involve more text spans with more potential signals.

47 Table 4.5: Distribution of Signaled Relations across Genres.

relations academic how-to guides news interview Joint 65 | 23.13% 76 | 18.67% 65 | 25.39% 54 | 16.77% Elaboration 61 | 21.71% 79 | 19.41% 53 | 20.70% 50 | 15.53% Preparation 25 | 8.9% 55 | 13.51 15 | 5.86% 34 | 10.56% Background 33 | 11.74% 24 | 5.9% 28 | 10.94% 27 | 8.39% Contrast 17 | 6.05% 21 | 5.16% 19 | 7.42% 11 | 3.42% Restatement N/A 20 | 4.91% 11 | 4.3% 29 | 9.01% Concession 17 | 6.05% 13 | 3.19% 10 | 3.91% 9 | 2.8% Justify 1 | 0.36% 11 | 2.7% 15 | 5.86% 22 | 6.83% Evaluation 10 | 3.56% 12 | 2.95% 7 | 2.73% 13 | 4.04% Solutionhood 2 | 0.71% 8 | 1.97% N/A 24 | 7.45% Condition N/A 25 | 6.14% 3 | 1.17% 3 | 0.93% Antithesis 3 | 1.07% 10 | 2.46% 1 | 0.39% 17 | 5.28% Sequence 12 | 4.27% 4 | 0.98 5 | 1.95% 5 | 1.55% Cause 6 | 2.14% 12 | 2.95 6 | 2.34% 1 | 0.31% Evidence 10 | 3.56% N/A 5 | 1.95% 9 | 2.8% Result 3 | 1.07% 6 | 1.47% 6 | 2.34% 6 | 1.86% Motivation N/A 21 | 5.16% N/A N/A Purpose 14 | 4.98% 5 | 1.23% N/A 2 | 0.62% Circumstance 2 | 0.36% 5 | 1.23% 7 | 2.73% 6 | 1.86% Total 281 | 100% 407 | 100% 256 | 100% 322 | 100%

Table 4.6: Examples of Anchored Tokens across Genres.

examples of anchored tokens in different genres discrimination (16), ; (11), and (8), : (5), to (5), but (5), also (5), though (3), hypothesized (3), based on (3), First & Second (3), academic however (3), because (2), More specifically (2), in/In order to (2), as (2), () (2), see (2), when (2), posited, expected, capturing, Albeit but (10), If (9)/if(7), ; (5), and (4), also (4), arrogant people (9), how-to How (7), : (3), so (3), – (3), But (3), Know (3), Steps (2), Move, guides Challenge, Warnings, In other words, Empty, Fasten, Tips, Wash IE6 (9), NASA (5), and (4), but (4)/But (2), Image (4), market (4) news However (2), the major source, the Udvar-Hazy Center, in 2009 Sarvis (14), What (12), Why (11), and (8), Noam Chomsky (8), but (5), interviews Wikinews (4), because (3), interview (2), – (2), Well (2), So (2), Which

48 Secondly, Table 4.6 lists some signal tokens that are indicative of genres (in boldface) as well as generic and coincidental ones (in italic). Each item is separated by a comma; the number in the parentheses is the count of that item attested in this dataset; if no number is indicated, then that item only occurs once. The selection of these single-occurrence items is random in order to reflect the relevance in contexts. Even though discourse markers (e.g. ‘and’, ‘but’) are present in all four genres, no associations can be established between these DMs and the genres they appear in. Moreover, as can be seen from Table 4.6, graphical features such as semicolons, colons, dashes, and parentheses play an important role in relation signaling. Although these punctuation marks do not seem to be indicative of any genres, academic articles tend to use them more as opposed to other genres. Even though some words or phrases are highly frequent such as discrimination in academic articles, arrogant people in how-to guides, IE6 in news, and Sarvis in interviews, they just seem to be coincidental as these are the subjects or topics being discussed in the articles.

Academic articles. Academic writing is typically formal, making the annotation more straightforward and easier. The results from this dataset suggest that academic articles contain signals with diverse categories. As shown in Table 4.6, in addition to the typical discourse markers and some graphical features mentioned above, there are several lexical items that are very strong signals indicating the genre. For instance, the verb hypothesized and its synonym posited are indicative in that researchers and scholars tend to use them in their research papers to present their hypotheses. Similarly, the word expected is another frequent word used by researchers to convey their expectations out of their hypotheses. The phrase based on is frequently used to elaborate on the subject matter. Furthermore, Table 4.6 also demonstrates that academic articles tend to use ordinal numbers such as First, Second to structure the

49 text. Last but not least, the word Albeit is an indicative signal of Concession of the four genres examined here but only used in academic writing suggesting that it could also be a signal indicative of the genre due to the register it is associated with.

How-to Guides. As demonstrated by Table 4.5, how-to guides are the most signaled genre in this study. This is due to the fact that instructional texts are highly orga- nized, and the cue phrases are usually obvious to identify. As shown in Table 4.6, there are several indicative signal tokens such as the wh-word How, an essential element in instructional texts. Besides, words like Steps, Tips, and Warnings are strongly associated with the genre due to its communicative need. Another distinct feature of how-to guides is the use of imperative clauses, which correspond to verbs whose first letter is capitalized (e.g. Know, Empty, Fasten, Wash, Move) since instructional texts are about giving instructions on accomplishing certain tasks and imperative clauses are good at conveying such information in a straightforward way.

News. Like academic articles, news articles are typically organized and structured as well. Even though RST-SC is built on the WSJ news articles, it is worth investigating the distribution of signals in the news articles in the GUM corpus as they are not from the same source. As briefly mentioned at the beginning of this section, news articles selected in this project are not as highly signaled as the news articles in RST-SC. In addition to the use of different source data, another reason is that the relation inventory used in GUM is smaller than the one used in RST-SC; as a result, certain information is lost. For instance, the relation Attribution is signaled 3,061 times out of 3070 occurrences (99.71%) in RST-SC, corresponding to the type syntactic and its subtype reported speech, which does not occur in this dataset. However, we

50 do have some indicative signals such as market and the major source.

Interviews. Interviews are the most difficult genre to annotate in this project due to two main reasons. Firstly, it is (partly) spoken language; as a result, they are not as organized as news or academic articles and harder to follow. Secondly, the layout of an interview is fundamentally different from the previous three written genres. Therefore, it is important to capture the discrepancy between spoken and written language. For instance, the relation Solutionhood seems specific to interviews, and most of the signal instances remain unanchored (i.e. 11 instances), which is likely due to the fact that the question mark (?) is ignored in the current annotation scheme. As can be seen from Table 4.6, there are many wh-words such as What, Why. These can be used towards identifying interviews in that they formulate the question-answer scheme. Moreover, interviewers and interviewees are also important constituents of an interview, which explains the high frequencies of the two interviewees Sarvis and Noam Chomsky and the interviewer Wikinews. Another unique feature shown by the signals in this dataset is the use of spoken expressions such as Well, So when talking, which rarely appear in written texts.

4.4 Interim Conclusion

In this chapter, we examined the distributions of signals across relations (Section 4.2) and genres (Section 4.3) respectively. Generally speaking, discourse markers are not only ambiguous but also inadequate as discourse signals; most signals are open-class lexical items. More specifically, both perspectives have revealed the fact that some signals are highly indicative while others are generic or ambiguous. Thus, in order to obtain more valid discourse signals and parse discourse relations effectively, we need to

51 develop models that take signals’ surrounding contexts into account to disambiguate these signals. For indicative signals, they can be broadly categorized into three groups: register-related, communicative-need related, and semantics-related. The first two are used to address genre specifications whereas the last one is used to address relation classifications. Words like Albeit are more likely to appear in academic papers than other genres due to the register they are associated with; words like Steps, Tips, and Warnings are more likely to appear in instructional texts due to the communication effect they intend to achieve. Semantics-related signals play a crucial role in classifying relations as the semantic associations between words or phrases are less ambiguous cues, thereby supplementing the inadequacy of discourse markers.

52 Chapter 5

Conclusion

The current pilot study attempts to anchor discourse signals across genres by adapting the hierarchical taxonomy of signals used in RST-SC, which only consists of Wall Street Journal articles. In this study, 12 documents with 11,145 tokens across four different genres selected from the GUM corpus are annotated for discourse signals. The taxonomy of signals used in this project is based on the one in RST-SC with additional types and subtypes proposed in order to better represent different genres. The results have shown that different genres have their indicative signals in addi- tion to generic ones. Moreover, although some relations such as Condition and Antithesis are mainly signaled by DMs, most of the relations are indicated by lex- ical items that are highly indicative and representative. The current study is limited to the rst annotation layer in the GUM corpus; however, it is worth investigating the linguistic representation of these signals through other layers of annotation in the GUM corpus such as coreference and bridging, which could be very useful resources constructing theoretical models of discourse. In addition, the current project provides a qualitative analysis on the validity of discourse signals by looking at the annotated signal tokens across relations and genres respectively, which provides insights into the disambiguation of generic signals and paves the way for designing a more informative mechanism to quantitatively measure the validity of discourse signals. Another limitation of this project is the selection of the genres. Since this is the first attempt to identify and anchor signals outside RST-DT, the selection of the genres is

53 constrained. It would be very interesting to see how signals are distributed in genres like reddit forum discussions and fiction, which are assumed to be less coherent and organized texts. In addition, future work could also delve into the distribution of the location of these signals; that is, whether a certain signal is distributed in the satellite span or the nucleus span, which is of great importance for constructing a more informative and comprehensive neural network model.

54 Bibliography

[1] Nicholas Asher and Alex Lascarides. of Conversation. Cambridge Uni- versity Press, 2003.

[2] Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. Building a Discourse- Tagged Corpus in the Framework of Rhetorical Structure Theory. In Current and New Directions in Discourse and Dialogue, Text, Speech and Language Tech- nology 22, pages 85–112. Kluwer, Dordrecht, 2003.

[3] Debopam Das and Maite Taboada. RST Signalling Corpus Annotation Manual, 2014.

[4] Debopam Das and Maite Taboada. Signalling of Coherence Relations in Dis- course, Beyond Discourse Markers. Discourse Processes, pages 1–29, 2017.

[5] Debopam Das and Maite Taboada. RST Signalling Corpus: A Corpus of Sig- nals of Coherence Relations. Language Resources and Evaluation, 52(1):149–184, 2018.

[6] John G. Firth. A Synopsis of Linguistic Theory 1930-1955. Studies in Linguistic Analysis, pages 1–32, 1957.

[7] Luke Gessler, Yang Liu, and Amir Zeldes. A Discourse Signal Annotation System for RST Trees. In Proceedings of 7th Workshop on Discourse Relation Parsing and Treebanking (DISRPT) at NAACL-HLT, Minneapolis, MN, 2019. (To Appear).

55 [8] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232, 2017.

[9] Barbara J Grosz and Candace L Sidner. Attention, Intentions, and the Structure of Discourse. Computational Linguistics, 12(3):175–204, 1986.

[10] Eduard H Hovy and Elisabeth Maier. Parsimonious or Profligate: How Many and Which Discourse Structure Relations. Discourse Processes, 1997.

[11] Andrew Kehler. Coherence, Reference, and the Theory of Grammar. CSLI Pub- lications, Stanford, CA, 2002.

[12] Walter Kintsch. The Role of Knowledge in Discourse Comprehension: A Construction-Integration Model. Psychological Review, 95(2):163, 1988.

[13] Alistair Knott and Robert Dale. Using Linguistic Phenomena to Motivate a Set of Coherence Relations. Discourse Processes, 18(1):35–62, 1994.

[14] Alistair Knott and Ted Sanders. The Classification of Coherence Relations and their Linguistic Markers: An Exploration of Two Languages. Journal of Prag- matics, 30(2):135–175, 1998.

[15] Yang Liu and Amir Zeldes. Discourse relations and signaling information: Anchoring discourse signals in RST-DT. Proceedings of the Society for Com- putation in Linguistics, 2(1):314–317, 2019.

[16] Xuezhe Ma and Eduard Hovy. End-to-End Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1064– 1074, 2016.

56 [17] William C Mann and Sandra A Thompson. Rhetorical Structure Theory: Toward a Functional Theory of Text Organization. Text-Interdisciplinary Journal for the Study of Discourse, 8(3):243–281, 1988.

[18] Daniel Marcu. Building up Rhetorical Structure Trees. In Proceedings of the National Conference on Artificial Intelligence, pages 1069–1074, 1996.

[19] Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a Large Annotated Corpus of English: The Penn Treebank. Special Issue on Using Large Corpora, Computational Linguistics, 19(2):313–330, 1993.

[20] James R Martin. English Text: System and Structure. John Benjamins Pub- lishing, 1992.

[21] Michael O’Donnell. The UAM CorpusTool: Software for Corpus Annotation and Exploration. In Proceedings of the XXVI congreso de AESLA, pages 3–5, Almeria, Spain, 2008.

[22] Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. The Penn Discourse Treebank 2.0. In Pro- ceedings of the 6th International Conference on Language Resources and Evalu- ation (LREC 2008), pages 2961–2968, Marrakesh, Morocco, 2008.

[23] Rashmi Prasad, Aravind Joshi, and Bonnie Webber. Realization of Discourse Relations by Other Means: Alternative Lexicalizations. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 1023– 1031. Association for Computational Linguistics, 2010.

57 [24] Rashmi Prasad, Bonnie Webber, and Aravind Joshi. Reflections on the Penn Discourse Treebank, Comparable Corpora, and Complementary A fnnotation. Computational Linguistics, 40(4):921–950, 2014.

[25] Ted JM Sanders, Wilbert PM Spooren, and Leo GM Noordman. Toward a Taxonomy of Coherence Relations. Discourse processes, 15(1):1–35, 1992.

[26] Manfred Stede. Discourse Processing. Synthesis Lectures on Human Language Technologies, 4(3):1–165, 2011.

[27] Maite Taboada and Debopam Das. Annotation upon Annotation: Adding Sig- nalling Information to a Corpus of Discourse Relations. D&D, 4(2):249–281, 2013.

[28] Maite Taboada and Julia Lavid. Rhetorical and Thematic Patterns in Scheduling Dialogues: A Generic Characterization. Functions of Language, 10(2):147–178, 2003.

[29] Maite Taboada and William C Mann. Applications of Rhetorical Structure Theory. Discourse Studies, 8(4):567–588, 2006.

[30] Maite Taboada and William C Mann. Rhetorical Structure Theory: Looking Back and Moving Ahead. Discourse Studies, 8(3):423–459, 2006.

[31] Amir Zeldes. rstWeb - A Browser-based Annotation Interface for Rhetorical Structure Theory and Discourse Relations. In Proceedings of NAACL-HLT 2016 System Demonstrations, pages 1–5, San Diego, CA, 2016.

[32] Amir Zeldes. The GUM Corpus: Creating Multilayer Resources in the Classroom. Language Resources and Evaluation, 51(3):581–612, 2017.

58 [33] Amir Zeldes. Multilayer Corpus Studies. Routledge Advances in Corpus Linguis- tics 22. Routledge, London, 2018.

[34] Amir Zeldes. A Neural Approach to Discourse Relation Signaling. Georgetown University Round Table (GURT) 2018: Approaches to Discourse, 2018.

[35] Shuo Zhang and Amir Zeldes. GitDox: A Linked Version Controlled Online XML Editor for Manuscript Transcription. In Proceedings of FLAIRS 2017, Special Track on Natural Language Processing of Ancient and other Low Resource Languages, pages 619–623, 2017.

59