Dimlex: a Lexicon of Discourse Markers for Text Generation and Understanding Manfred Stede and Carla Umbach Technische Universitgt Berlin Projektgruppe KIT Sekr
Total Page:16
File Type:pdf, Size:1020Kb
DiMLex: A lexicon of discourse markers for text generation and understanding Manfred Stede and Carla Umbach Technische Universitgt Berlin Projektgruppe KIT Sekr. FR 6-10 Franklinstr. 28/29 D-10587 Berlin, Germany email: {stede[umbach}@cs.tu-berlin.de Abstract Despite the fact that we were in SoHo, we Discourse markers ('cue words') are lexical found a cheap bar. items that signal the kind of coherence relation Notwithstanding the fact that we were in holding between adjacent text spans; for exam- SoHo, we found a cheap bar. ple, because, since, and for this reason are dif- Although we were in SoHo, we found a cheap ferent markers for causal relations. Discourse bar. markers are a syntactically quite heterogeneous If one accepts these sentences as paraphrases, group of words, many of which are traditionally then the various discourse markers all need to be treated as function words belonging to the realm associated with the information that they sig- of grammar rather than to the lexicon. But for nal a concessive relationship between the two a single discourse relation there is often a set propositions involved. Next, the fine-grained of similar markers, allowing for a range of para- differences between similar markers need to be phrases for expressing the relation. To capture represented; one such difference is the degree of the similarities and differences between these, specificity: for example, but can mark a general and to represent them adequately, we are devel- CONTRAST or a more specific CONCESSION. ~,~e oping DiMLex, a lexicon of discourse markers. believe that a dedicated discourse marker lexi- After describing our methodology and the kind con holding this kind of information can serve of information to be represented in DiMLex, we as a valuable resource for natural language pro- briefly discuss its potential applications in both cessing. Our efforts in constructing that lexicon text generation and understanding. are described in Section 2. From the perspective of text generation, not 1 Introduction all paraphrases listed above are equally felici- Assuming that text can be formally described tous in specific contexts. In order to choose (and represented) by means of discourse rela- the most appropriate variant, a generator needs tions holding between adjacent portions of text knowledge about the fine-grained differences be- (e.g., [Mann, Thompson 1988]), we use the term tween similar markers for the same relation. discourse marker for those lexical items that (in Furthermore, it needs to account for the interac- addition to non-lexical means such as punctua- tions between marker choice and other genera- tion, aspectual and focus shifts, etc.) can sig- tion decisions and hence needs knowledge about nal the presence of a relation at the linguistic the syntagmatic constraints associated with dif- surface. Typically, a discourse relation is asso- ferent markers. We will discuss this perspective ciated with a wide range of such markers; con- in Section 3. sider, for instance, the following variety of CON- From the perspective of text understanding, CESSIONS, which all express the same underly- a sophisticated system should be able to derive ing propositional content. The words treated the discourse relations holding between adjacent here as discourse markers are underlined. text spans, and also to notice the additional We were in SoHo; {nevertheless[ nonetheless semantic and pragmatic implications stemming I however ] still ] yet}, we found a cheap bar. from the usage of a particular discourse marker. We were in SoHo, but we found a cheap bar We will briefly characterize such applications in anyway. Section 4. 1238 2 Building a Discourse Marker Accordingly, we propose that the proper place Lexicon for describing discourse markers is a dedicated 2.1 The idea lexicon that provides a classification of their syntactic, semantic and pragmatic features and The traditional distinction between content characterizes the relationships between similar words and function words (or open-class and markers. To this end, our group is developing closed-class items) relies on the stipulation that a Discourse Marker LEXicon (DiMLex), which the former have their "own" meaning indepen- aims at assembling the various information as- dent of the context in which they are used, sociated with markers and describing it on a whereas the latter assume meaning only in con- uniform level of representation. Our initial fo- text. Then, content words are assigned to the cus is on German, but English will also be a realm of the lexicon, whereas function words are target language. treated as a part of grammar. For dealing with discourse markers, we do not 2.2 Methodology regard this distinction as particularly helpful, Methodological considerations pertain to the though. As we have illustrated above and will two tasks of determining the set of words we elaborate below, these words can carry a wide regard as discourse markers and thus are to be variety of semantic and pragmatic overtones, included in the lexicon, and determining the lex- which render the choice of a marker meaning- ical entries for these words. driven, as opposed to a mere consequence of Finding the "right" set of discourse markers structural decisions. Furthermore, a number of is not an easy task, since the common lexico- lexical relations that are customary used to as- graphic practice of taking part of speech as the sign structure to the universe of "open class" primary criterion for inclusion or exclusion does lexical items, most prominently synonymy, ple- not apply. Knott and Mellish [1996] provide an sionymy ("near-synonymy"), antonymy, hy- apt summary of the situation. Their 'test for ponymy and polysemy, can be applied to dis- relational phrases' is a good start, but geared course markers as well: towards the English language (we are investigat- ing German as well), and furthermore it catches • Synonymy: It can be argued that true only items relating clauses; in Despite the heavy synonyms do not exist at all. However, rain, we went for a walk it would not detect a the German words obzwar and obschon cue phrase. (both more formal variants of obwohl = al- To arrive at a more comprehensive set, we though) certainly come very close to being began by consulting standard grammars such' synonymous. as Quirk et al. [1972] and Helbig and Buscha • Plesionymy: although and though, accord- [1991], which provide descriptions of function ing to Martin [1992], differ in formality; al- words grouped according to semantic class -- though and even though differ in terms of but these are far from "complete". A very emphasis. good source for German is [Brausse et al. in • Antonomy: if/unless, according to Barker prep.], which investigates a huge set of connec- [1994], have opposite polarity, as in He will tives from a grammatical viewpoint. not attend unless he finishes his paper vs. As for determining lexical descriptions, the He will attend if he finishes his paper. research literature offers a large number of help- • Hyponomy: Some markers are more spe- ful, even though quite heterogeneous, sources. cific than others; recall the example of but There are several detailed studies of individ- given above. Knott and Mellish [1996] deal ual groups of markers, such as [Vander Linden, with the issue of "taxonomizing" discourse Martin 1995] for PURPOSE markers. Besides, markers. the Linguistics literature offers fine-grained • Polysemy: Other than being more or less analyses of individual markers, which are far too specific, some markers can signal quite dif- numerous to list. We are drawing upon all these ferent relations; e.g., while can be used for sources, trying to place them in a single unified TEMPORAL CO-OCCURRENCE, and also for framework. The overall goal can be character- CONTRAST. ized as the aim to synthesize two strands of re- 1239 search that so far are rather disconnected: Moving towards pragmatics, the intention be- hind using a marker can vary. A well-known ex- • "Top-down": Text linguistics considers ample is the contrast between German aber and markers as a means to signal coherence, sondern (in English, they both correspond to and provides us with insights on the se- but), where the former merely states a contrast, mantic and pragmatic properties of marker whereas the latter corrects an assumption on classes. the hearer's side (e.g., [Helbig, Buscha 1991]). • "Bottom-up": Grammars as well as the Another dimension concerns the presuppositions linguistic research literature provide syn- associated with markers; a well-known case is tactic, semantic and stylistic properties of the contrast between because and since, where individual markers, comparative studies of only the latter marks the subsequent proposi- related markers, etc. tion as given. The German CAUSE markers well 2.3 The lexicon and denn differ in terms of the illocutions they connect: the former applies to propositions, the Although our classification of lexical features is latter to epistemic judgements [Brausse et al., in still under development, we give here a tenta- prep.]. Certain very similar markers differ only tive list of such features in order to illustrate the stylistically. One German example was given range of phenomena under consideration. The above, and another one is the English notwith- list is loosely ordered from syntactic to seman- standing, which is more formal than despite but tic and pragmatic features; for now, we do not moreover is more flexible in positioning, as it explicitly assign such categories. can be postponed. The part of speech of a marker (conjunctive, The final but crucial feature to be mentioned subordinating conjunction, coordinating con- here is the discourse relation expressed by a junction, preposition) determines the possibil- marker. RST [Mann, Thompson 1988] offers ities of the marker within the con- positioning an inspiring theory of such relations, but we do stituent: conjunctives (especially the German not fully subscribe to this account.