Realization of Discourse Relations by Other Means: Alternative Lexicalizations
Rashmi Prasad and Aravind Joshi Bonnie Webber University of Pennsylvania University of Edinburgh rjprasad,[email protected] [email protected]
Abstract This paper focusses on the problem of how to characterize and identify explicit signals of dis- Studies of discourse relations have not, in course relations, exemplified in Ex. (1). To re- the past, attempted to characterize what fer to all such signals, we use the term “discourse serves as evidence for them, beyond lists relation markers” (DRMs). Past research (e.g., of frozen expressions, or markers, drawn (Halliday and Hasan, 1976; Martin, 1992; Knott, from a few well-defined syntactic classes. 1996), among others) has assumed that DRMs In this paper, we describe how the lexical- are frozen or fixed expressions from a few well- ized discourse relation annotations of the defined syntactic classes, such as conjunctions, Penn Discourse Treebank (PDTB) led to adverbs, and prepositional phrases. Thus the lit- the discovery of a wide range of additional erature presents lists of DRMs, which researchers expressions, annotated as AltLex (alterna- try to make as complete as possible for their cho- tive lexicalizations) in the PDTB 2.0. Fur- sen language. In annotating lexicalized discourse ther analysis of AltLex annotation sug- relations of the Penn Discourse Treebank (Prasad gests that the set of markers is open- et al., 2008), this same assumption drove the ini- ended, and drawn from a wider variety tial phase of annotation. A list of “explicit con- of syntactic types than currently assumed. nectives” was collected from various sources and As a first attempt towards automatically provided to annotators, who then searched for identifying discourse relation markers, we these expressions in the text and annotated them, propose the use of syntactic paraphrase along with their arguments and senses. The same methods. assumption underlies methods for automatically 1 Introduction identifying DRMs (Pitler and Nenkova, 2009). Since expressions functioning as DRMs can also Discourse relations that hold between the content have non-DRM functions, the task is framed as of clauses and of sentences – including relations one of classifying given individual tokens as DRM of cause, contrast, elaboration, and temporal or- or not DRM. dering – are important for natural language pro- In this paper, we argue that placing such syn- cessing tasks that require sensitivity to more than tactic and lexical restrictions on DRMs limits just a single sentence, such as summarization, in- a proper understanding of discourse relations, formation extraction, and generation. In written which can be realized in other ways as well. For text, discourse relations have usually been con- example, one should recognize that the instantia- sidered to be signaled either explicitly, as lexical- tion (or exemplification) relation between the two ized with some word or phrase, or implicitly due sentences in Ex. (3) is explicitly signalled in the to adjacency. Thus, while the causal relation be- second sentence by the phrase Probably the most tween the situations described in the two clauses egregious example is, which is sufficient to ex- in Ex. (1) is signalled explicitly by the connective press the instantiation relation. As a result, the same relation is conveyed implic- (3) Typically, these laws seek to prevent executive itly in Ex. (2). branch officials from inquiring into whether cer- tain federal programs make any economic sense or (1) John was tired. As a result he left early. proposing more market-oriented alternatives to reg- (2) John was tired. He left early. ulations. Probably the most egregious example is a proviso in the appropriations bill for the executive markers of discourse relations. Following stan- office that prevents the president’s Office of Man- dard practice, a list of such markers – called “ex- agement and Budget from subjecting agricultural marketing orders to any cost-benefit scrutiny. plicit connectives” in the PDTB – was collected from various sources (Halliday and Hasan, 1976; Cases such as Ex. (3) show that identifying Martin, 1992; Knott, 1996; Forbes-Riley et al., DRMs cannot simply be a matter of preparing a 2006).2 These were provided to annotators, who list of fixed expressions and searching for them in then searched for these expressions in the corpus the text. We describe in Section 2 how we identi- and marked their arguments, senses, and attribu- fied other ways of expressing discourse relations tion.3 In the pilot phase of the annotation, we in the PDTB. In the current version of the cor- also went through several iterations of updating pus (PDTB 2.0.), they are labelled as AltLex (al- the list, as and when annotators reported seeing ternative lexicalizations), and are “discovered” as connectives that were not in the current list. Im- a result of our lexically driven annotation of dis- portantly, however, connectives were constrained course relations, including explicit as well as im- to come from a few well-defined syntactic classes: plicit relations. Further analysis of AltLex anno- • tations (Section 3) leads to the thesis that DRMs Subordinating conjunctions: e.g., because, are a lexically open-ended class of elements which although, when, while, since, if, as. may or may not belong to well-defined syntactic • Coordinating conjunctions: e.g., and, but, classes. The open-ended nature of DRMs is a so, either..or, neither..nor. challenge for their automated identification, and in Section 4, we point to some lessons we have • Prepositional phrases: e.g., as a result, on already learned from this annotation. Finally, we the one hand..on the other hand, insofar as, suggest that methods used for automatically gen- in comparison. erating candidate paraphrases may help to expand • adverbs: e.g., then, however, instead, yet, the set of recognized DRMs for English and for likewise, subsequently other languages as well (Section 5). Ex. (4) illustrates the annotation of an explicit 2 AltLex in the PDTB connective. (In all PDTB examples in the paper, Arg2 is indicated in boldface, Arg1 is in italics, The Penn Discourse Treebank (Prasad et al., the DRM is underlined, and the sense is provided 2008) constitutes the largest available resource of in parentheses at the end of the example.) lexically grounded annotations of discourse rela- tions, including both explicit and implicit rela- (4) U.S. Trust, a 136-year-old institution that is one of 1 the earliest high-net worth banks in the U.S., has tions. Discourse relations are assumed to have faced intensifying competition from other firms that two and only two arguments, called Arg1 and have established, and heavily promoted, private- Arg2. By convention, Arg2 is the argument syn- banking businesses of their own. As a result, U.S. Trust’s earnings have been hurt. (Contin- tactically associated with the relation, while Arg1 gency:Cause:Result) is the other argument. Each discourse relation is also annotated with one of the several senses in the After all explicit connectives in the list were PDTB hierarchical sense classification, as well as annotated, the next step was to identify implicit the attribution of the relation and its arguments. discourse relations. We assumed that such rela- In this section, we describe how the annotation tions are triggered by adjacency, and (because of methodology of the PDTB led to the identification resource limitations) considered only those that of the AltLex relations. held between sentences within the same para- Since one of the major goals of the annota- graph. Annotators were thus instructed to supply tion was to lexically ground each relation, a first a connective – called “implicit connective” – for step in the annotation was to identify the explicit 2All explicit connectives annotated in the PDTB are listed in the PDTB manual (PDTB-Group, 2008). 1http://www.seas.upenn.edu/˜pdtb 3These guidelines are recorded in the PDTB manual. each pair of adjacent sentences, as long as the re- Under these conditions, annotators were in- lation was not already expressed with one of the structed to look for and mark as Altlex, whatever explicit connectives provided to them. This proce- alternative expression appeared to denote the re- dure led to the annotation of implicit connectives lation. Thus, for example, Ex. (6) was annotated such as because in Ex. (5), where a causal relation as AltLex because although a causal relation is in- is inferred but no explicit connective is present in ferred between the sentences, inserting a connec- the text to express the relation. tive like because makes expression of the relation (5) To compare temperatures over the past 10,000 redundant. Here the phrase One reason is is taken years, researchers analyzed the changes in concen- to denote the relation and is marked as AltLex. trations of two forms of oxygen. (Implicit=because) These measurements can indicate temperature (6) Now, GM appears to be stepping up the pace of its changes, . . . (Contingency:Cause:reason) factory consolidation to get in shape for the 1990s. One reason is mounting competition from new Annotators soon noticed that in many cases, Japanese car plants in the U.S. that are pour- they were not able to supply an implicit connec- ing out more than one million vehicles a year at costs lower than GM can match. (Contin- tive. Reasons supplied included (a) “there is a re- gency:Cause:reason) lation between these sentences but I cannot think of a connective to insert between them”, (b) “there The result of this procedure led to the annota- is a relation between the sentences for which I tion of 624 tokens of AltLex in the PDTB. We can think of a connective, but it doesn’t sound turn to our analysis of these expressions in the good”, and (c) “there is no relation between the next section. sentences”. For all such cases, annotators were instructed to supply “NONE” as the implicit con- 3 What is found in AltLex? nective. Later, we sub-divided these “NONE” im- plicits into “EntRel”, for the (a) type above (an Several questions arise when considering the Alt- entity-based coherence relation, since the second Lex annotations. What kind of expressions are sentence seemed to continue the description of they? What can we learn from their syntax? some entity mentioned in the first); “NoRel” (no Do they project discourse relations of a different relation) for the (c) type; and “AltLex”, for the (b) sort than connectives? How can they be identi- type, which we turn to next. fied, both during manual annotation and automat- Closer investigation of the (b) cases revealed ically? To address these questions, we examined that the awkwardness perceived by annotators the AltLex annotation for annotated senses, and when inserting an implicit connective was due to for common lexico-syntactic patterns extracted redundancy in the expression of the relation: Al- using alignment with the Penn Treebank (Marcus though no explicit connective was present to re- et al., 1993).4 late the two sentences, some other expression ap- peared to be doing the job. This is indeed what 3.1 Lexico-syntactic Characterization we found. Subsequently, instances of AltLex were We found that we could partition AltLex annota- annotated if: tion into three groups by (a) whether or not they 1. A discourse relation can be inferred between belonged to one of the syntactic classes admit- adjacent sentences. ted as explicit connectives in the PDTB, and (b) whether the expression was frozen (ie, blocking 2. There is no explicit connective present to re- free substitution, modification or deletion of any late them. of its parts) or open-ended. The three groups are shown in Table 1 and discussed below. 3. The annotator is not able to insert an im- plicit connective to express the inferred rela- 4The source texts of the PDTB come from the Penn tion (having used “NONE” instead), because Treebank (PTB) portion of the Wall Street Journal corpus. The PDTB corpus provides PTB tree alignments of all its inserting it leads to an awkward redundancy text span annotations, including connectives, AltLex’s, argu- in expressing the relation. ments of relations, and attribution spans. AltLex Group No (%) Examples Syntactically 92 (14.7%) quite the contrary (ADVP), for one thing (PP), as well (ADVP), admitted, lexi- too (ADVP), soon (ADVP-TMP), eventually (ADVP-TMP), cally frozen thereafter (RB), even (ADVP), especially (ADVP), actually (ADVP), still (ADVP), only (ADVP), in response (PP) Syntactically 54 (8.7%) What’s more (SBAR-ADV), Never mind that (ADVP- free, lexically TMP;VB;DT), To begin with (VP), So (ADVP-PRD-TPC), frozen Another (DT), further (JJ), As in (IN;IN), So what if (ADVP;IN), Best of all (NP) Syntactically 478 (76.6%) That compares with (NP-SBJ;VBD;IN), After these payments and lexically (PP-TMP), That would follow (NP-SBJ;MD;VB), The plunge free followed (NP-SBJ;VBD), Until then (PP-TMP), The increase was due mainly to (NP-SBJ;VBD;JJ;RB;TO), That is why (NP- SBJ;VBZ;WHADVP), Once triggered (SBAR-TMP) TOTAL 624 –
Table 1: Breakdown of AltLex by Syntactic and Lexical Flexibility. Examples in the third column are accompanied (in parentheses) with their PTB POS tags and constituent phrase labels obtained from the PDTB-PTB alignment.
Syntactically admitted and lexically frozen: brands, many of which are virtually carbon The first row shows that 14.7% of the strings an- copies of one other. (Expansion:Conjunction) notated as AltLex belong to syntactic classes ad- Many of these AltLex annotations do not con- mitted as connectives and are similarly frozen. stitute a single constituent in the PTB, as with (Syntactic class was obtained from the PDTB- Never mind that. These cases suggest that ei- PTB alignment.) So, despite the effort in prepar- ther the restrictions on connectives as frozen ex- ing a list of connectives (cf. Section 1), additional pressions should be relaxed to admit all syntactic ones were still found in the corpus through AltLex classes, or the syntactic analyses of these multi- annotation. This suggests that any pre-defined list word expressions is irrelevant to their function. of connectives should only be used to guide anno- Both syntactically and lexically free: This tators in a strategy for “discovering” connectives. third group (Table 1, row 3) constitutes the major- Syntactically free and lexically frozen: AltLex ity of AltLex annotations – 76.6% (478/624). Ad- expressions that were frozen but belonged to syn- ditional examples are shown in Table 2. Common tactic classes other than those admitted for the syntactic patterns here include subjects followed PDTB explicit connectives accounted for 8.7% by verbs (Table 2a-c), verb phrases with comple- (54/624) of the total (Table 1, row 2). For exam- ments (d), adverbial clauses (e), and main clauses ple, the AltLex What’s more (Ex. 7) is parsed as with a subordinating conjunction (f). a clause (SBAR) functioning as an adverb (ADV). All these AltLex annotations are freely modifi- It is also frozen, in not undergoing any change (eg, able, with their fixed and modifiable parts shown What’s less, What’s bigger, etc.5 in the regular expressions defined for them in Ta- (7) Marketers themselves are partly to blame: They’ve ble 2. Each has a fixed “core” phrase shown as increased spending for coupons and other short- lexical tokens in the regular expression, e.g, con- term promotions at the expense of image-building sequence of, attributed to, plus obligatory and op- advertising. What’s more, a flood of new prod- ucts has given consumers a dizzying choice of tional elements shown as syntactic labels. Op- < > 5 tional elements are shown in parentheses. NX Apparently similar headless relative clauses such as < > What’s more exciting differ from What’s more in not func- indicates any noun phrase, PPX , any prepo- tioning as adverbials, just as NPs. sitional phrase,
Table 2: Complex AltLex strings and their patterns
Table 3: Annotated and Unannotated instances of AltLex glish biomedical articles with discourse relations, through AltLex annotation). Preliminary analysis Yu et al (2008) report finding many DRMs that of the results reveals many DRMs that don’t ap- don’t appear in the WSJ (e.g., as a consequence). pear anywhere in the WSJ Corpus (eg, as a con- If one is to fully exploit DRMs in classifying sequence, as an example, by the same token), as discourse relations, one must be able to identify well as additional DRMs that appear in the cor- them all, or at least many more of them than we pus but were not annotated as AltLex (e.g., above have to date. One method that seems promising all, after all, despite that). Many of these latter is Callison-Burch’s paraphrase generation through instances appear in the initial sentence of a para- back-translation on pairs of word-aligned corpora graph, but the annotation of implicit connectives (Callison-Birch, 2007). This method exploits the — which is what led to AltLex annotation in the frequency with which a word or phrase is back first place (Section 2) — was not carried out on translated (from texts in language A to texts in these sentences. language B, and then back from texts in language There are two further things to note before clos- B to texts in language A) across a range of pivot ing this discussion. First, there is an additional languages, into other words or phrases. source of noise in using back-translation para- While there are many factors that introduce phrase to expand the set of identified DRMs. This low-frequency noise into the process, including arises from the fact that discourse relations can lexical ambiguity and errors in word alignment, be conveyed either explicitly or implicitly, and Callison-Burch’s method benefits from being able a translated text may not have made the same to use the many existing word-aligned translation choices vis-a-vis explicitation as its source, caus- pairs developed for creating translation models for ing additional word alignment errors (some of SMT. Recently, Callison-Burch showed that para- which are interesting, but most of which are not). phrase errors could be reduced by syntactically Secondly, this same method should prove useful constraining the phrases identified through back- for languages other English, although there will be translation to ones with the same syntactic cat- an additional problem to overcome for languages egory as assigned to the source (Callison-Birch, (such as Turkish) in which DRMs are conveyed 2008), using a large set of syntactic categories through morphology as well as through distinct similar to those used in CCG (Steedman, 2000). words and phrases. For DRMs, the idea is to identify through back- 6 Related work translation, instances of DRMs that were neither included in our original set of explicit connec- We are not the first to recognize that discourse re- tive nor subsequently found through AltLex an- lations can realized by more than just one or two notation. To allow us to carry out a quick pi- syntactic classes. Halliday and Hasan (1976) doc- lot study, Callison-Burch provided us with back- ument prepositional phrases like After that being translations of 147 DRMs (primarily explicit con- used to express conjunctive relations. More im- nectives annotated in the PDTB 2.0, but also in- portantly, they note that any definite description cluding a few from other syntactic classes found can be substituted for the demonstrative pronoun. Similarly, Taboada (2006), in looking at how of- class expressions is problematic. We drew on our ten RST-based rhetorical relations are realized by experience in creating the lexically grounded an- discourse markers, starts by considering only ad- notations of the Penn Discourse Treebank, and verbials, prepositional phrases, and conjunctions, showed that markers of discourse relations should but then notes the occurrence of a single instance instead be treated as open-class items, with uncon- of a nominal fragment The result in her corpus. strained syntactic possibilities. Manual annota- Challenging the RST assumption that the basic tion and automatic identification practices should unit of a discourse is a clause, with discourse rela- develop methods in line with this finding if they tions holding between adjacent clausal units, Kib- aim to exhaustively identify all discourse relation ble (1999) provides evidence that informational markers. discourse relations (as opposed to intentional dis- course relations) can hold intra-clausally as well, Acknowledgments with the relation “verbalized” and its arguments We want to thank Chris Callison-Burch, who realized as nominalizations, as in Early treatment graciously provided us with EuroParl back- with Brand X can prevent a cold sore developing. translation paraphrases for the list of connectives Since his focus is intra-clausal, he does not ob- we sent him. This work was partially supported serve that verbalized discourse relations can hold by NSF grant IIS-07-05671. across sentences as well, where a verb and one of its arguments function similarly to a discourse adverbial, and in the end, he does not provide a References proposal for how to systematically identify these alternative realizations. Le Huong et al. (2003), Callison-Birch, Chris. 2007. Paraphrasing and Trans- lation. Ph.D. thesis, School of Informatics, Univer- in developing an algorithm for recognizing dis- sity of Edinburgh. course relations, consider non-verbal realizations (called NP cues) in addition to verbal realizations Callison-Birch, Chris. 2008. Syntactic constraints on (called VP cues). However, they provide only one paraphrases extracted from parallel corpora. In Pro- ceedings of Conference on Empirical Methods in example of such a cue (“the result”). Like Kib- Natural Language Processing (EMNLP). ble (1999), Danlos (2006) and Power (2007) also focus only on identifying verbalizations of dis- Danlos, Laurence. 2006. Discourse verbs. In Pro- course relations, although they do consider cases ceedings of the 2nd Workshop on Constraints in Dis- course, pages 59–65, Maynooth, Ireland. where such relations hold across sentences. What has not been investigated in prior work Forbes-Riley, Katherine, Bonnie Webber, and Aravind is the basis for the alternation between connec- Joshi. 2006. Computing discourse semantics: The tives and AltLex’s, although there are several ac- predicate-argument semantics of discourse connec- counts of why a language may provide more than tives in D-LTAG. Journal of Semantics, 23:55–106. one connective that conveys the same relation. Halliday, M. A. K. and Ruqaiya Hasan. 1976. Cohe- For example, the alternation in Dutch between sion in English. London: Longman. dus (“so”), daardoor (“as a result”), and daarom (“that’s why”) is explained by Pander Maat and Huong, LeThanh, Geetha Abeysinghe, and Christian Huyck. 2003. Using cohesive devices to recog- Sanders (2000) as having its basis in “subjectiv- nize rhetorical relations in text. In Proceedings of ity”. 4th Computational Linguistics UK Research Collo- quium (CLUK 4), University of Edinburgh, UK. 7 Conclusion and Future Work Kibble, Rodger. 1999. Nominalisation and rhetorical Categorizing and identifying the range of ways in structure. In Proceedings of ESSLLI Formal Gram- which discourse relations are realized is impor- mar conference, Utrecht. tant for both discourse understanding and gener- Knott, Alistair. 1996. A Data-Driven Methodology ation. In this paper, we showed that existing prac- for Motivating a Set of Coherence Relations. Ph.D. tices of cataloguing these ways as lists of closed thesis, University of Edinburgh, Edinburgh. Lin, Ziheng, Min-Yen Kan, and Hwee Tou Ng. 2009. Sporleder, Caroline and Alex Lascarides. 2008. Using Recognizing implicit discourse relations in the penn automatically labelled examples to classify rhetori- discourse treebank. In Proceedings of the Confer- cal relations: an assessment. Natural Language En- ence on Empirical Methods in Natural Language gineering, 14(3):369–416. Processing, Singapore. Steedman, Mark. 2000. The Syntactic Process. MIT Maat, Henk Pander and Ted Sanders. 2000. Do- Press, Cambridge MA. mains of use or subjectivity? the distribution of three dutch causal connectives explained. TOPICS Taboada, Maite. 2006. Discourse markers as signals IN ENGLISH LINGUISTICS, pages 57–82. (or not) of rhetorical relations. Journal of Pragmat- ics, 38(4):567–592. Marcu, Daniel and Abdessamad Echihabi. 2002. An unsupervised approach to recognizing discourse re- Yu, Hong, Nadya Frid, Susan McRoy, P Simpson, lations. In Proceedings of the Association for Com- Rashmi Prasad, Alan Lee, and Aravind Joshi. 2008. putational Linguistics. Exploring discourse connectivity in biomedical text for text mining. In Proceedings of the 16th Annual Marcus, Mitchell P., Beatrice Santorini, and Mary Ann International Conference on Intelligent Systems for Marcinkiewicz. 1993. Building a large annotated Molecular Biology BioLINK SIG Meeting, Toronto, corpus of english: The Penn Treebank. Computa- Canada. tional Linguistics, 19(2):313–330.
Martin, James R. 1992. English text: System and structure. Benjamins, Amsterdam.
Oza, Umangi, Rashmi Prasad, Sudheer Kolachina, Dipti Mishra Sharma, and Aravind Joshi. 2009. The hindi discourse relation bank. In Proceedings of the ACL 2009 Linguistic Annotation Workshop III (LAW-III), Singapore.
Pitler, Emily and Ani Nenkova. 2009. Using syntax to disambiguate explicit discourse connectives in text. In Proceedings of the Joint Conference of the 47th Meeting of the Association for Computational Lin- guistics and the 4th International Joint Conference on Natural Language Processing, Singapore.
Pitler, Emily, Annie Louis, and Ani Nenkova. 2009. Automatic sense prediction for implicit discourse relations in text. In Proceedings of the Joint Con- ference of the 47th Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing.
Power, Richard. 2007. Abstract verbs. In ENLG ’07: Proceedings of the Eleventh European Workshop on Natural Language Generation, pages 93–96, Mor- ristown, NJ, USA. Association for Computational Linguistics.
Prasad, Rashmi, Nikhil Dinesh, Alan Lee, Eleni Milt- sakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse TreeBank 2.0. In Proceedings of 6th International Conference on Language Resources and Evaluation (LREC 2008).
PDTB-Group. 2008. The Penn Discourse TreeBank 2.0 Annotation Manual. Technical Report IRCS-08- 01, Institute for Research in Cognitive Science, Uni- versity of Pennsylvania.