Acquiring the Meaning of Discourse Markers

Ben Hutchinson School of Informatics University of Edinburgh [email protected]

Abstract that of Di Eugenio et al. (1997), which addresses the This paper applies machine learning techniques to problem of learning if and where discourse markers acquiring aspects of the meaning of discourse mark- should be generated. ers. Three subtasks of acquiring the meaning of a Unfortunately, the manual classification of large discourse marker are considered: learning its polar- numbers of discourse markers has proven to be a ity, veridicality, and type (i.e. causal, temporal or difficult task, and no complete classification yet ex- additive). Accuracy of over 90% is achieved for all ists. For example, Knott (1996) presents a list of three tasks, well above the baselines. around 350 discourse markers, but his taxonomic classification, perhaps the largest classification in 1 Introduction the literature, accounts for only around 150 of these. A general method of automatically classifying dis- This paper is concerned with automatically acquir- course markers would therefore be of great utility, ing the meaning of discourse markers. By con- both for English and for languages with fewer man- sidering the distributions of individual tokens of ually created resources. This paper constitutes a discourse markers, we classify discourse markers step in that direction. It attempts to classify dis- along three dimensions upon which there is substan- course markers whose classes are already known, tial agreement in the literature: polarity, veridical- and this allows the classifier to be evaluated empiri- ity and type. This approach of classifying linguistic cally. types by the distribution of linguistic tokens makes The proposed task of learning automatically the this research similar in spirit to that of Baldwin and meaning of discourse markers raises several ques- Bond (2003) and Stevenson and Merlo (1999). tions which we hope to answer: Discourse markers signal relations between discourse units. As such, discourse markers play an Q1. Difficulty How hard is it to acquire the mean- important role in the parsing of natural language ing of discourse markers? Are some aspects of discourse (Forbes et al., 2001; Marcu, 2000), and meaning harder to acquire than others? their correspondence with discourse relations can be exploited for the unsupervised learning of dis- Q2. Choice of features What features are useful course relations (Marcu and Echihabi, 2002). In for acquiring the meaning of discourse mark- addition, generating natural language discourse re- ers? Does the optimal choice of features de- quires the appropriate selection and placement of pend on the aspect of meaning being learnt? discourse markers (Moser and Moore, 1995; Grote Q3. Classifiers Which machine learning algo- and Stede, 1998). It follows that a detailed account rithms work best for this task? Can the right of the semantics and pragmatics of discourse mark- choice of empirical features make the classifiers would be a useful resource for natural language cation problems linearly separable? processing. Rather than looking at the finer subtleties in Q4. Evidence Can corpus evidence be found for meaning of particular discourse markers (e.g. Best- the existing classifications of discourse mark- gen et al. (2003)), this paper aims at a broad scale ers? Is there empirical evidence for a separate classification of a subclass of discourse markers: class of TEMPORAL markers? structural connectives. This breadth of coverage is of particular importance for discourse parsing, We proceed by first introducing the classes of dis- where a wide range of linguistic realisations must be course markers that we use in our experiments. Sec- catered for. This work can be seen as orthogonal to tion 3 discusses the database of discourse markers used as our corpus. In Section 4 we describe our ex- the other hand, a NEG-POL discourse marker like periments, including choice of features. The results but always co-occurs with a negative polarity dis- are presented in Section 5. Finally, we conclude and course relation. discuss future work in Section 6. The gold standard classes of POS-POL and NEG- POL discourse markers used in the learning exper- 2 Discourse markers iments are shown in Table 1. The gold standards Discourse markers are lexical items (possibly multi- for all three experiments were compiled by consult- word) that signal relations between propositions, ing a range of previous classifications (Knott, 1996; events or speech acts. Examples of discourse mark- Knott and Dale, 1994; Louwerse, 2001). 2 ers are given in Tables 1, 2 and 3. In this paper POS-POL NEG-POL we will focus on a subclass of discourse markers after, and, as, as soon as, although, known as structural connectives. These markers, because, before, considering but, even if, even though they may be multiword expressions, that, ever since, for, given that, even though, function syntactically as if they were coordinating if, in case, in order that, in that, even when, or subordinating conjunctions (Webber et al., 2003). insofar as, now, now that, on only if, only The literature contains many different classi- the grounds that, once, seeing when, or, or fications of discourse markers, drawing upon a as, since, so, so that, the in- else, though, wide range of evidence including textual co- stant, the moment, then, to the unless, until, hesion (Halliday and Hasan, 1976), hypotactic extent that, when, whenever whereas, yet conjunctions (Martin, 1992), cognitive plausibil- ity (Sanders et al., 1992), substitutability (Knott, Table 1: Discourse markers used in the polarity ex- 1996), and psycholinguistic experiments (Louw- periment erse, 2001). Nevertheless there is also considerable agreement. Three dimensions of classification that 2.2 Veridicality recur, albeit under a variety of names, are polarity, veridicality and type. We now discuss each of these A discourse relation is veridical if it implies the in turn. truth of both its arguments (Asher and Lascarides, 2003), otherwise it is not. For example, in (3) it is 2.1 Polarity not necessarily true either that David can stay up or Many discourse markers signal a concession, a con- that he promises, or will promise, to be quiet. For trast or the denial of an expectation. These mark- this reason we will say if has the feature veridical- ers have been described as having the feature polarity=NON-VERIDICAL. ity=NEG-POL. An example is given in (1). (3) David can stay up if he promises to be quiet.

(1) Suzy’s part-time, but she does more work The disjunctive discourse marker or is also NON- than the rest of us put together. (Taken from VERIDICAL, because it does not imply that both Knott (1996, p. 185)) of its arguments are true. On the other hand, and This sentence is true if and only if Suzy both is part- does imply this, and so has the feature veridical- time and does more work than the rest of them put ity=VERIDICAL. together. In addition, it has the additional effect of The VERIDICAL and NON-VERIDICAL discourse signalling that the fact Suzy does more work is sur- markers used in the learning experiments are shown prising — it denies an expectation. A similar effect in Table 2. Note that the polarity and veridicality can be obtained by using the connective and and are independent, for example even if is both NEG- adding more context, as in (2) POL and NON-VERIDICAL. (2) Suzy’s efficiency is astounding. She’s 2.3 Type part-time, and she does more work than the Discourse markers like because signal a CAUSAL rest of us put together. relation, for example in (4). account, discourse markers have positive polarity only if they The difference is that although it is possible for can never be paraphrased using a discourse marker with nega- and to co-occur with a negative polarity discourse tive polarity. Interpreted in these terms, our experiment aims to relation, it need not. Discourse markers like and are distinguish negative polarity discourse markers from all others. 2 said to have the feature polarity=POS-POL. 1 On An effort was made to exclude discourse markers whose classification could be contentious, as well as ones which 1An alternative view is that discourse markers like and are showed ambiguity across classes. Some level of judgement was underspecified with respect to polarity (Knott, 1996). In this therefore exercised by the author. VERIDICAL NON- The ADDITIVE, TEMPORAL and CAUSAL dis- VERIDICAL course markers used in the learning experiments are after, although, and, as, as soon assuming shown in Table 3. These features are independent as, because, but, considering that, even if, of the previous ones, for example even though is that, even though, even when, if, if ever, if CAUSAL, VERIDICAL and NEG-POL. ever since, for, given that, in or- only, in case, der that, in that, insofar as, now, on condition ADDITIVE TEMPORAL CAUSAL now that, on the grounds that, that, on the and, but, after, as although, because, once, only when, seeing as, assumption whereas soon as, even though, for, given since, so, so that, the instant, that, only if, before, that, if, if ever, in case, the moment, then, though, to or, or else, ever on condition that, on the extent that, until, when, supposing since, the assumption that, whenever, whereas, while, yet that, unless now, now on the grounds that, that, once, provided that, provid- Table 2: Discourse markers used in the veridicality until, ing that, so, so that, experiment when, supposing that, though, whenever unless

Table 3: Discourse markers used in the type experiment (4) The tension in the boardroom rose sharply because the chairman arrived. 3 Corpus As a result, because has the feature The data for the experiments comes from a type=CAUSAL. Other discourse markers that database of sentences collected automatically from express a temporal relation, such as after, have the British National Corpus and the world wide the feature type=TEMPORAL. Just as a POS-POL web (Hutchinson, 2004). The database contains ex- discourse marker can occur with a negative polarity ample sentences for each of 140 discourse structural discourse relation, the context can also supply a connectives. causal relation even when a TEMPORAL discourse Many discourse markers have surface forms with marker is used, as in (5). other usages, e.g. before in the phrase before noon. The following procedure was therefore used to se- (5) The tension in the boardroom rose sharply lect sentences for inclusion in the database. First, after the chairman arrived. sentences containing a string matching the sur- If the relation a discourse marker signals is nei- face form of a structural connective were extracted. These sentences were then parsed using a statistical ther CAUSAL or TEMPORAL it has the feature parser (Charniak, 2000). Potential structural con- type=ADDITIVE. nectives were then classified on the basis of their The need for a distinct class of TEMPORAL discourse relations is disputed in the literature. On syntactic context, in particular their proximity to S nodes. Figure 1 shows example syntactic contexts the one hand, it has been suggested that TEMPO- which were used to identify discourse markers. RAL relations are a subclass of ADDITIVE ones on the grounds that the temporal reference inherent (S ...) (CC and) (S...) in the marking of tense and aspect “more or less” (SBAR (IN after) (S...)) fixes the temporal ordering of events (Sanders et al., (PP (IN after) (S...)) 1992). This contrasts with arguments that resolv- (PP (VBN given) (SBAR (IN that) (S...))) ing discourse relations and temporal order occur as (NP (DT the) (NN moment) (SBAR...)) (ADVP (RB as) (RB long) distinct but inter-related processes (Lascarides and (SBAR (IN as) (S...))) Asher, 1993). On the other hand, several of the dis- (PP (IN in) (SBAR (IN that) (S...))) course markers we count as TEMPORAL, such as as soon as, might be described as CAUSAL (Oberlan- Figure 1: Identifying structural connectives der and Knott, 1995). One of the results of the experiments described below is that corpus evidence It is because structural connectives are easy to suggests ADDITIVE, TEMPORAL and CAUSAL dis- identify in this manner that the experiments use only course markers have distinct distributions. this subclass of discourse markers. Due to both parser errors, and the fact that the syntactic heuris- New label Penn Treebank labels tics are not foolproof, the database contains noise. vb vb vbd vbg vbn vbp vbz Manual analysis of a sample of 500 sentences re- nn nn nns nnp vealed about 12% of sentences do not contain the jj jj jjr jjs discourse marker they are supposed to. rb rb rbr rbs Of the discourse markers used in the experiments, aux aux auxg md their frequencies in the database ranged from 270 prp prp prp$ for the instant to 331,701 for and. The mean num- in in ber of instances was 32,770, while the median was 4,948. Table 4: Clustering of POS labels 4 Experiments This section presents three machine learning experiments into automatically classifying discourse markers according to their polarity, veridicality been found to be useful on a related classification and type. We begin in Section 4.1 by describing task (Hutchinson, 2003). The discourse markers the features we extract for each discourse marker used for this are based on the list of 350 markers token. Then in Section 4.2 we describe the differ- given by Knott (1996), and include multiword ex- ent classifiers we use. The results are presented in pressions. Due to the sparser nature of discourse Section 4.3. markers, compared to verbs for example, no frequency cutoffs were used. 4.1 Features used 4.1.2 Linguistically motivated features We only used structural connectives in the experi- These included a range of one and two dimensional ments. This meant that the clauses linked syntacti- features representing more abstract linguistic infor- cally were also related at the discourse level (Web- mation, and were extracted through automatic anal- ber et al., 2003). Two types of features were ex- ysis of the parse trees. tracted from the conjoined clauses. Firstly, we used lexical co-occurrences with words of various parts One dimensional features of speech. Secondly, we used a range of linguisti- Two one dimensional features recorded the location cally motivated syntactic, semantic, and discourse of discourse markers. POSITION indicated whether features. a discourse marker occurred between the clauses it 4.1.1 Lexical co-occurrences linked, or before both of them. It thus relates to information structuring. EMBEDDING indicated the Lexical co-occurrences have previously been shown level of embedding, in number of clauses, of the dis- to be useful for discourse level learning tasks (La- course marker beneath the sentence’s highest level pata and Lascarides, 2004; Marcu and Echihabi, clause. We were interested to see if some types of 2002). For each discourse marker, the words occur- discourse relations are more often deeply embed- ring in their superordinate (main) and subordinate ded. clauses were recorded,3 along with their parts of The remaining features recorded the presence of speech. We manually clustered the Penn Treebank linguistic features that are localised to a particu- parts of speech together to obtain coarser grained lar clause. Like the lexical co-occurrence features, syntactic categories, as shown in Table 4. these were indexed by the clause they occurred in: We then lemmatised each word and excluded all either SUPER or SUB. lemmas with a frequency of less than 1000 per mil- We expected negation to correlate with nega- lion in the BNC. Finally, words were attached a pre- tive polarity discourse markers, and approximated fix of either SUB or SUPER according to whether negation using four features. NEG-SUBJ and NEG- they occurred in the sub- or superordinate clause VERB indicated the presence of subject negation linked by the marker. This distinguished, for exam- (e.g. nothing) or verbal negation (e.g. n’t). We also ple, between occurrences of then in the antecedent recorded the occurrence of a set of negative polar- (subordinate) and consequent (main) clauses linked ity items (NPI), such as any and ever. The features by if. NPI-AND-NEG and NPI-WO-NEG indicated whether We also recorded the presence of other discourse an NPI occurred in a clause with or without verbal markers in the two clauses, as these had previously or subject negation. 3For coordinating conjunctions, the left clause was taken to Eventualities can be placed or ordered in time us- be superordinate/main clause, the right, the subordinate clause. ing not just discourse markers but also temporal ex- explored. The first metric was the Euclidean dis-

pressions. The feature TEMPEX recorded the num- tance function ¤¦¥ , shown in (6), applied to proba- ber of temporal expressions in each clause, as re- bility distributions. turned by a temporal expression tagger (Mani and Wilson, 2000).

If the main verb was an inﬂection of to be or to do ¥ ¨ ¨¨ ¤§¥©¨ (6) we recorded this using the features BE and DO. Our motivation was to capture any correlation of these

verbs with states and events respectively. The second, !"¤¦# , is a smoothed variant of If the ﬁnal verb was a modal auxiliary, this el- the information theoretic Kullback-Leibner diver-

lipsis was evidence of strong cohesion in the text gence (Lee, 2001, with $%'&)(+*©, ). Its deﬁnition (Halliday and Hasan, 1976). We recorded this with is given in (7). the feature VP-ELLIPSIS. Pronouns also indicate cohesion, and have been shown to correlate with sub-

jectivity (Bestgen et al., 2003). A class of features ¨

!"¤§#¨ - ©¦ /¨ 1032©4

PRONOUNS represented pronouns, with denot- $-¨657¨98:;$/< ¨ ing either 1st person, 2nd person, or 3rd person ani- (7)

mate, inanimate or plural. B The syntactic structure of each clause was cap- The third metric, =6>1?@?@A , is a -test weighted adap- tured using two features, one finer grained and one tion of the Jaccard coefficient (Curran and Moens, coarser grained. STRUCTURAL-SKELETON identi- 2002). In it basic form, the Jaccard coefficient is es- fied the major constituents under the S or VP nodes, sentially a measure of how much two distributions e.g. a simple double object construction gives “NP overlap. The B -test variant weights co-occurrences VB NP NP”. ARGS identified whether the clause by the strength of their collocation, using the fol- contained an (overt) object, an (overt) subject, or lowing function:

both, or neither. CFE

The overall size of a clause was represented us- CFE

/¨ GH /¨ < /¨

C CFE

ing four features. WORDS, NPS and PPS recorded BD¨

CFE < ¨ the numbers of words, NPs and PPs in a clause (not /¨ counting embedded clauses). The feature CLAUSES This is then used deﬁne the weighted version of counted the number of clauses embedded beneath a the Jaccard coefﬁcient, as shown in (8). The words

clause.

associated with distributions and are indicated CLK Two dimensional features by CJ and , respectively.

These features all recorded combinations of linguis-

PRQTS

CJ C CLK

tic features across the two clauses linked by the C

¨ BD¨ U BM¨

= >1?M? ¨ ON

discourse marker. For example the MOOD feature A

C CJ C CLK

>V/¨ BM¨ U BD¨ £

would take the value ¢ DECL,IMP for the sentence N John is coming, but don’t tell anyone! (8)

These features were all determined automatically

!"¤¦# =6>1?@? and A had previously been found to by analysing the auxiliary verbs and the main verbs’ be the best metrics for other tasks involving lexi- POS tags. The features and the possible values for

cal similarity. ¤¦¥ is included to indicate what can each clause were as follows: MODALITY: one of be achieved using a somewhat naive metric. FUTURE, ABILITY or NULL; MOOD: one of DECL, The second classifier used, Naive Bayes, takes IMP or INTERR; PERFECT: either YES or NO; PRO- the overall distribution of each class into account. It GRESSIVE: either YES or NO; TENSE: either PAST essentially defines a decision boundary in the form or PRESENT. of a curved hyperplane. The Weka implementa- 4.2 Classifier architectures tion (Witten and Frank, 2000) was used for the ex- Two different classifiers, based on local and global periments, with 10-fold cross-validation. methods of comparison, were used in the experi- 4.3 Results ments. The first, 1 Nearest Neighbour (1NN), is an We began by comparing the performance of instance based classifier which assigns each marker the 1NN classifier using the various lexical co- to the same class as that of the marker nearest to occurrence features against the gold standards. The it. For this, three different distance metrics were results using all lexical co-occurrences are shown

All POS Best single POS Best

¤¦¥ !W¤§# =6>1?@? ¤¦¥ !"¤¦# =6>1?@? A Task Baseline A subset

polarity 67.4 74.4 72.1 74.4 76.7 (rb) 83.7 (rb) 76.7 (rb) 83.7 X

veridicality 73.5 81.6 85.7 75.5 83.7 (nn) 91.8 (vb) 87.8 (vb) 91.8 Y

type 58.1 74.2 64.5 81.8 74.2 (in) 74.2 (rb) 77.4 (jj) 87.8 Z

[

a \^]`_ bdcfegehcfiUjfk l \^]`_ Using \^]`_ and either rb or DMs+rb. Using both and vb, and and vb+in. Using and vb+aux+in

Table 5: Results using the 1NN classiﬁer on lexical co-occurrences

Feature Positively correlated discourse marker co-occurrences

m m m

POS-POL though m , but , although , assuming that

m n n m m m

NEG-POL otherwise n , still , in truth , still , after that , in this way , granted that , in

n n

contrast m , by then , in the event

n n m m m m

VERIDICAL obviouslyn , now , even , indeed , once more , considering that , even after , m

once more n , at ﬁrst sight

m m m m n

NON-VERIDICAL or m , no doubt , in turn , then , by all means , before then

n n n n n n

ADDITIVE also n , in addition , still , only , at the same time , clearly , naturally , n

now n , of course

m m m n m (D(D(

TEMPORAL back m , once more , like , and , once more , which was why ,

n n n m n n m

CAUSAL again m ,altogether ,back ,ﬁnally , also , thereby , at once , while , (D(D(

clearly m , p Table 6: Most informative discourse marker co-occurrences in the super- ( o ) and subordinate ( ) clauses in Table 5. The baseline was obtained by assigning were equal. However this did not lead to any better

discourse markers to the largest class, i.e. with the results. =6>1?M? most types. The best results obtained using just a One property that distinguishes A from the single POS class are also shown. The results across other metrics is that it weights features the strength the different metrics suggest that adverbs and verbs of their collocation. We were therefore interested are the best single predictors of polarity and veridi- to see which co-occurrences were most informa- cality, respectively. tive. Using Weka’s feature selection utility, we We next applied the 1NN classifier to co- ranked discourse marker co-occurrences by their in- occurrences with discourse markers. The results are formation gain when predicting polarity, veridical- shown in Table 7. The results show that for each ity and type. The most informative co-occurrences task 1NN with the weighted Jaccard coefficient per- are listed in Table 6. For example, if also occurs in forms at least as well as the other three classifiers. the subordinate clause then the discourse marker is more likely to be ADDITIVE. 1NN with metric: Naive

The 1NN and Naive Bayes classiﬁers were then

¤§¥ !"¤¦# = >1?M? Task A Bayes applied to co-occurrences with just the DMs that polarity 74.4 81.4 81.4 81.4 were most informative for each task. The results, veridicality 83.7 79.6 83.7 73.5 shown in Table 8, indicate that the performance of type 74.2 80.1 80.1 58.1 1NN drops when we restrict ourselves to this subset. 4 Table 7: Results using co-occurrences with DMs However Naive Bayes outperforms all previous 1NN classiﬁers. We also compared using the following combina- Base- 1NN with: Naive

tions of different parts of speech: vb + aux, vb + in, !"¤¦# Task line ¤§¥ Bayes vb + rb, nn + prp, vb + nn + prp, vb + aux + rb, vb + polarity 67.4 72.1 69.8 90.7 aux + in, vb + aux + nn + prp, nn + prp + in, DMs + veridicality 73.5 85.7 77.6 91.8 rb, DMs + vb and DMs + rb + vb. The best results type 58.1 67.7 58.1 93.5 obtained using all combinations tried are shown in the last column of Table 5. For DMs + rb, DMs + vb Table 8: Results using most informative DMs and DMs + rb + vb we also tried weighting the co-

4 bdcfege occurrences so that the sums of the co-occurrences The k metric is omitted because it essentially already with each of verbs, adverbs and discourse markers has its own method of factoring in informativity. Feature Positively correlated features

POS-POL No signiﬁcantly informative predictors correlated positively

m m ¢ £

NEG-POL NEG-VERBAL m , NEG-SUBJ , ARGS=NONE , MODALITY= ABILITY,ABILITY

n m ¢ £

VERIDICAL VERB=BE m , WORDS , WORDS , MODALITY= NULL,NULL

Jfqhrsht Jfqhrsht

¥ ¥

m n

NON-VERID TEMPEX , PRONOUN m , PRONOUN

m n ¢ £

ADDITIVE WORDS n , WORDS , CLAUSES , MODALITY= ABILITY,FUTURE ,

£ n ¢ £

MODALITY= ¢ ABILITY,ABILITY , NPS , MODALITY= FUTURE,FUTURE , £

MOOD= ¢ DECLARATIVE,DECLARATIVE

Juqvrsgtw Ezy

¢ £

TEMPORAL EMBEDDING=7, PRONOUN m , MOOD= INTERROGATIVE,DECLARATIVE

n n n

CAUSAL NEG-SUBJ n , NEG-VERBAL , NPI-WO-NEG , NPI-AND-NEG , £

MODALITY= ¢ NULL,FUTURE p Table 9: The most informative linguistically motivated predictors for each class. The indices o and indicate that a one dimensional feature belongs to the superordinate or subordinate clause, respectively.

Weka’s feature selection utility was also applied likely to be signalling a positive polarity discourse to all the linguistically motivated features described relation between Y and Z than a negative po- in Section 4.1.2. The most informative features are larity one. This suggests that a negative polar- shown in Table 9. Naive Bayes was then applied ity discourse relation is less likely to be embed- using both all the linguistically motivated features, ded directly beneath another negative polarity dis- and just the most informative ones. The results are course relation. Thirdly, negation correlates with shown in Table 10. the main clause of NEG-POL discourse markers, and it also correlates with subordinate clause of All Most CAUSAL ones. Fourthly, NON-VERIDICAL corre- Task Baseline features informative lates with second person pronouns, suggesting that a polarity 67.4 74.4 72.1 writer/speaker is less likely to make assertions about veridicality 73.5 77.6 79.6 the reader/listener than about other entities. Lastly, type 58.1 64.5 77.4 the best results with knowledge poor features, i.e. Table 10: Naive Bayes and linguistic features lexical co-occurrences, were better than those with linguistically sophisticated ones. It may be that the sophisticated features are predictive of only certain 5 Discussion subclasses of the classes we used, e.g. hypotheticals, or signallers of contrast. The results demonstrate that discourse markers can be classified along three different dimensions with 6 Conclusions and future work an accuracy of over 90%. The best classifiers used a global algorithm (Naive Bayes), with co- We have proposed corpus-based techniques for clas- occurrences with a subset of discourse markers as sifying discourse markers along three dimensions: features. The success of Naive Bayes shows that polarity, veridicality and type. For these tasks we with the right choice of features the classification were able to classify with accuracy rates of 90.7%, task is highly separable. The high degree of accu- 91.8% and 93.5% respectively. These equate to er- racy attained on the type task suggests that there is ror reduction rates of 71.5%, 69.1% and 84.5% from empirical evidence for a distinct class of TEMPO- the baseline error rates. In addition, we determined RAL markers. which features were most informative for the differ- The results also provide empirical evidence for ent classification tasks. the correlation between certain linguistic features In future work we aim to extend our work in two and types of discourse relation. Here we restrict directions. Firstly, we will consider finer-grained ourselves to making just five observations. Firstly, classification tasks, such as learning whether a verbs and adverbs are the most informative parts of causal discourse marker introduces a cause or a con- speech when classifying discourse markers. This sequence, e.g. distinguishing because from so. Sec- is presumably because of their close relation to ondly, we would like to see how far our results can the main predicate of the clause. Secondly, Ta- be extended to include adverbial discourse markers, ble 6 shows that the discourse marker DM in the such as instead or for example, by using just fea- structure X, but/though/although Y DM Z is more tures of the clauses they occur in. Acknowledgements phenomena to motivate a set of coherence relations. I would like to thank Mirella Lapata, Alex Las- Discourse Processes, 18(1):35–62. Alistair Knott. 1996. A data-driven methodology for carides, Bonnie Webber, and the three anonymous motivating a set of coherence relations. Ph.D. thesis, reviewers for their comments on drafts of this pa- University of Edinburgh. per. This research was supported by EPSRC Grant Mirella Lapata and Alex Lascarides. 2004. Inferring GR/R40036/01 and a University of Sydney Travel- sentence-internal temporal relations. In In Proceed- ling Scholarship. ings of the Human Language Technology Confer- ence and the North American Chapter of the Associ- References ation for Computational Linguistics Annual Meeting, Nicholas Asher and Alex Lascarides. 2003. Logics of Boston, MA. Conversation. Cambridge University Press. Alex Lascarides and Nicholas Asher. 1993. Temporal Timothy Baldwin and Francis Bond. 2003. Learning the interpretation, discourse relations and common sense countability of English nouns from corpus data. In entailment. Linguistics and Philosophy, 16(5):437– Proceedings of ACL 2003, pages 463–470. 493. Yves Bestgen, Liesbeth Degand, and Wilbert Spooren. Lillian Lee. 2001. On the effectiveness of the skew di- 2003. On the use of automatic techniques to deter- vergence for statistical language analysis. Artificial mine the semantics of connectives in large newspaper Intelligence and Statistics, pages 65–72. corpora: An exploratory study. In Proceedings of the Max M Louwerse. 2001. An analytic and cognitive pa- MAD’03 workshop on Multidisciplinary Approaches rameterization of coherence relations. Cognitive Lin- to Discourse, October. guistics, 12(3):291–315. Eugene Charniak. 2000. A maximum-entropy-inspired Inderjeet Mani and George Wilson. 2000. Robust tem- parser. In Proceedings of the First Conference of the poral processing of news. In Proceedings of the North American Chapter of the Association for Com- 38th Annual Meeting of the Association for Compu- putational Linguistics (NAACL-2000), Seattle, Wash- tational Linguistics (ACL 2000), pages 69–76, New ington, USA. Brunswick, New Jersey. James R. Curran and M. Moens. 2002. Improvements in Daniel Marcu and Abdessamad Echihabi. 2002. An automatic thesaurus extraction. In Proceedings of the unsupervised approach to recognizing discourse rela- Workshop on Unsupervised Lexical Acquisition, pages tions. In Proceedings of the 40th Annual Meeting of 59–67, Philadelphia, PA, USA. the Association for Computational Linguistics (ACL- Barbara Di Eugenio, Johanna D. Moore, and Massimo 2002), Philadelphia, PA. Paolucci. 1997. Learning features that predict cue Daniel Marcu. 2000. The Theory and Practice of Dis- usage. In Proceedings of the 35th Conference of the course Parsing and Summarization. The MIT Press. Association for Computational Linguistics (ACL97), Jim Martin. 1992. English Text: System and Structure. Madrid, Spain, July. Benjamin, Amsterdam. Katherine Forbes, Eleni Miltsakaki, Rashmi Prasad, M. Moser and J. Moore. 1995. Using discourse analy- Anoop Sarkar, Aravind Joshi, and Bonnie Webber. sis and automatic text generation to study discourse 2001. D-LTAG system—discourse parsing with a lex- cue usage. In Proceedings of the AAAI 1995 Spring icalised tree adjoining grammar. In Proceedings of the Symposium on Empirical Methods in Discourse Inter- ESSLI 2001 Workshop on Information Structure, Dis- pretation and Generation, pages 92–98. course Structure, and Discourse Semantics, Helsinki, Jon Oberlander and Alistair Knott. 1995. Issues in Finland. cue phrase implicature. In Proceedings of the AAAI Brigitte Grote and Manfred Stede. 1998. Discourse Spring Symposium on Empirical Methods in Dis- marker choice in sentence planning. In Eduard Hovy, course Interpretation and Generation. editor, Proceedings of the Ninth International Work- Ted J. M. Sanders, W. P. M. Spooren, and L. G. M. No- shop on Natural Language Generation, pages 128– ordman. 1992. Towards a taxonomy of coherence re- 137. Association for Computational Linguistics, New lations. Discourse Processes, 15:1–35. Brunswick, New Jersey. Suzanne Stevenson and Paola Merlo. 1999. Automatic M. Halliday and R. Hasan. 1976. Cohesion in English. verb classification using distributions of grammatical Longman. features. In Proceedings of the 9th Conference of the Ben Hutchinson. 2003. Automatic classification of dis- European Chapter of the ACL, pages 45–52, Bergen, course markers by their co-occurrences. In Proceed- Norway. ings of the ESSLLI 2003 workshop on Discourse Par- Bonnie Webber, Matthew Stone, Aravind Joshi, and Al- ticles: Meaning and Implementation, Vienna, Austria. istair Knott. 2003. Anaphora and discourse structure. Ben Hutchinson. 2004. Mining the web for discourse Computational Linguistics, 29(4):545–588. markers. In Proceedings of the Fourth International Ian H. Witten and Eibe Frank. 2000. Data Mining: Conference on Language Resources and Evaluation Practical machine learning tools with Java implemen- (LREC 2004), Lisbon, Portugal. tations. Morgan Kaufmann, San Francisco. Alistair Knott and Robert Dale. 1994. Using linguistic