Discourse Structure and Coherence

Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Discourse structure and coherence

Christopher Potts

CS 244U: Natural language understanding Mar 1

1 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Discourse segmentation and discourse coherence

1 Discourse segmentation: chunking texts into coherent units. (Also: chunking separate documents)

2 (Local) discourse coherence: characterizing the meaning relationships between clauses in text.

2 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Discourse segmentation examples

(The inverted pyramid design)

3 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Discourse segmentation examples

(Pubmed highly structured abstract)

3 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Discourse segmentation examples

(Pubmed less structured abstract)

3 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Discourse segmentation examples

(5-star Amazon review)

3 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Discourse segmentation examples

(3-star Amazon review)

3 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Discourse segmentation applications (complete in class)

4 / 48 7 A: Sue isn’t here. B: She is feeling ill. 8 A: Where is Bill? B: In Bytes Cafe.´ 9 A: Pass the cake mix. (Stone 2002) B: Here you go.

Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Coherence examples

1 Sam brushed his teeth. He got into bed. He felt a certain ennui.

2 Sue was feeling ill. She decided to stay home from work.

3 Sue likes bananas. Jill does not.

4 The senator introduced a new initiative. He hoped to please undecided voters.

5 Linguists like quantiﬁers. In his lectures, Richard talked only about every and most.

6 In his lectures, Richard talked only about every and most. Linguists like quantiﬁers.

5 / 48 7 A: Sue isn’t here. B: She is feeling ill. 8 A: Where is Bill? B: In Bytes Cafe.´ 9 A: Pass the cake mix. (Stone 2002) B: Here you go.

Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Coherence examples

1 Sam brushed his teeth. then He got into bed. then He felt a certain ennui.

2 Sue was feeling ill. so She decided to stay home from work.

3 Sue likes bananas. but Jill does not.

4 The senator introduced a new initiative. because He hoped to please undecided voters.

5 Linguists like quantiﬁers. for example In his lectures, Richard talked only about every and most.

6 In his lectures, Richard talked only about every and most. in general Linguists like quantiﬁers.

5 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Coherence examples

1 Sam brushed his teeth. then He got into bed. then He felt a certain ennui.

2 Sue was feeling ill. so She decided to stay home from work.

3 Sue likes bananas. but Jill does not.

4 The senator introduced a new initiative. because He hoped to please undecided voters.

5 Linguists like quantiﬁers. for example In his lectures, Richard talked only about every and most.

6 In his lectures, Richard talked only about every and most. in general Linguists like quantiﬁers. 7 A: Sue isn’t here. B: She is feeling ill. 8 A: Where is Bill? B: In Bytes Cafe.´ 9 A: Pass the cake mix. (Stone 2002) B: Here you go.

5 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Coherence examples

1 Sam brushed his teeth. then He got into bed. then He felt a certain ennui.

2 Sue was feeling ill. so She decided to stay home from work.

3 Sue likes bananas. but Jill does not.

4 The senator introduced a new initiative. because He hoped to please undecided voters.

5 Linguists like quantiﬁers. for example In his lectures, Richard talked only about every and most.

6 In his lectures, Richard talked only about every and most. in general Linguists like quantiﬁers. 7 A: Sue isn’t here. B: because She is feeling ill. 8 A: Where is Bill? B: answer In Bytes Cafe.´ 9 A: Pass the cake mix. (Stone 2002) B: fulﬁllment Here you go.

5 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Coherence in linguistics

Extremely important sub-area: Driving force behind coreference resolution (Kehler et al. 2007). • Driving force behind the licensing conditions on ellipsis (Kehler 2000, 2002). • Alternative strand of explanation for the inferences that are often treated as • conversational implicatures in Gricean pragmatics (Hobbs 1979). Motivation for viewing meaning as a dynamic, discourse-level phenomenon • (Asher and Lascarides 2003). For an overview of topics, results, and theories, see Kehler 2004.

6 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Coherence applications in NLP (complete in class)

7 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Plan and goals

Plan Unsupervised and supervised discourse segmentation • Discourse coherence theories • Introduction to the Penn Discourse Treebank 2.0 • Unsupervised discovery of coherence relations •

Goals Discourse segmentation: practical, easy to implement algorithms that can • improve lots of information extraction tasks. Discourse coherence: a deep, important, challenging task that has to be • solved if we are to achieve robust NLU

8 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Discourse segmentation

9 / 48 Overview Discourse segmentation Discourse coherence theories PennTextTiling: Discourse Treebank Segmenting 2.0 Unsupervised Text coherence into Conclusion Multi-paragraph Subtopic Passages Discourse segmentation Hearst TextTiling Hearst’s 21-paragraph science news articleMartiStargazer A. Hearst* Xerox PARC

I TextTiling is a techniquefor subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation that corresponds well to human judgments of the subtopic boundaries of 12 texts. Multi-paragraph subtopic segmentation should be useful for many text analysis tasks, including == information retrieval and summarization.

1. Introduction

Most work in discourse processing, both theoretical and computational, has focused on analysis of interclausal or intersentential phenomena. This level of analysis is im- 23 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 portant for many discourse-processing tasks, such as anaphor resolution and dialogue generation.I f However, important and interesting discourse phenomena also occur at 1; 2; 3; ,; 3; 6; 7; ~o ~ .... the level of the paragraph. This article describes a paragraph-level model of discourse Figure 5 structure based on the notion of subtopic shift, and an algorithm for subdividing Judgments of seven readers on the Stargazer text. Internal numbersexpository indicate texts intolocation multi-paragraph of gaps "passages" or subtopic segments. between paragraphs; x-axis indicates token-sequence gap number,In y-axisthis work, indicates the structure judge of an expository text is characterized as a sequence of number, a break in a horizontal line indicates a judge-specifiedsubtopical segment discussions break. that occur in the context of one or more main topic discussions. Consider a 21-paragraph science news article, called Stargazers, whose main topic is the existence of life on earth and other planets. Its contents can be described as consisting 0.7 of the following subtopic discussions (numbers indicate paragraphs):

lm3 Intro - the search for life in space 0.6 I 4--5 The moon's chemical composition 6m8 How early earth-moon proximity shaped the moon , , I 9--12 How the moon helped life evolve on earth 13 Improbability of the earth-moon system 14--16 Binary/trinary star systems make life unlikely 17--18 The low probability of nonbinary/trinary systems 19--20 Properties of earth's sun that facilitate life

4 $ 67 8151 101 21 Summary

Subtopic structure is sometimes marked in technical texts by headings9 / 48 and sub- headings. Brown and Yule (1983, 140) state that this kind of division is one of the most basic in discourse. However, many expository texts consist of long sequences of paragraphs with very little structural demarcation, and for these a subtopical segmentation can be useful. 0 10 20 30 40 50 60 70 80 90 100 Figure 6 Results of the block similarity algorithm on the Stargazer text with k set to 10 and the loose boundary cutoff limit. Both the smoothed and unsmoothed plot* 3333 are Coyote shown. Hill Rd,Internal Palo Alto, numbers CA. 94304. E-mail: [email protected] indicate paragraph numbers, x-axis indicates token-sequence gap number, y-axis indicates similarity between blocks centered at the corresponding token-sequence gap. Vertical lines indicate boundaries chosen by the algorithm; for example, the(~) leftmost1997 Association vertical for Computational line represents Linguistics a boundary after paragraph 3. Note how these align with the boundary gaps of Figure 5 above.

this one location (in the spirit of a Grosz and Sidner [1986] "pop" operation). Thus it displays low similarity both to itself and to its neighbors. This is an example of a breakdown caused by the assumptions about the subtopic structure. Because of the depth score cutoff, not all valleys are chosen as boundaries. Al- though there is a dip around paragraph gaps 5 and 6, no boundary is marked there. From the summary of the text's contents in Section 1, we know that paragraphs 4 and 5 discuss the moon's chemical composition while 6 to 8 discuss how it got its shape; these two subtopic discussions are more similar to one another in content than they are to the subtopics on either side of them, thus accounting for the small change in similarity.

55 b b 2,3 3,4 ··· 1 Smooth S using average smoothing over window size a to get Sˆ. (Sˆ) 2 Set number of boundaries B as µ(Sˆ) 2 3 Score each boundary bi using (bi 1 bi )+(bi+1 bi ) 4 Choose the top B boundaries by these scores.

Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

The TextTiling algorithm (Hearst 1994, 1997)

w w w 1 2 3 ··· s1 sum [ s1 ] sum s1 sum s1 Score this boundary ··· via cosine similarity h i h i s2 between the blocks’ s vectors 3 s2 s2 s2 s4 . . . sum . sum . sum . s5 2 . 3 2 3 2 3 ··· 6 7 6 7 6 7 6 s7 7 6 7 6 7 s6 6 7 6 s7 7 6 s7 7 6 7 6 7 6 7 s 6 7 6 7 6 7 7 46 57 46 57 46 57

s8 s9 . .

Score vector S: b1,2

10 / 48 b 3,4 ··· 1 Smooth S using average smoothing over window size a to get Sˆ. (Sˆ) 2 Set number of boundaries B as µ(Sˆ) 2 3 Score each boundary bi using (bi 1 bi )+(bi+1 bi ) 4 Choose the top B boundaries by these scores.

Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

The TextTiling algorithm (Hearst 1994, 1997)

w w w 1 2 3 ··· s1 s1 s1 s1 sum sum sum Score this boundary s2 s2 s2 s2 ··· " # " # " # via cosine similarity

s3 between the blocks’ vectors s4 s3 s3 s3 s5 . . . sum . sum . sum . s6 2 3 2 3 2 3 ··· 6 7 6 7 6 7 s 6 s8 7 6 s8 7 6 s8 7 7 6 7 6 7 6 7 s 6 7 6 7 6 7 8 46 57 46 57 46 57

s9 . .

Score vector S: b1,2 b2,3

10 / 48 1 Smooth S using average smoothing over window size a to get Sˆ. (Sˆ) 2 Set number of boundaries B as µ(Sˆ) 2 3 Score each boundary bi using (bi 1 bi )+(bi+1 bi ) 4 Choose the top B boundaries by these scores.

Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

The TextTiling algorithm (Hearst 1994, 1997)

w w w 1 2 3 ··· s1 s1 s1 s1 s2 sum s2 sum s2 sum s2 2 3 2 3 2 3 ··· Score this boundary s3 6 s3 7 6 s3 7 6 s3 7 6 7 6 7 6 7 via cosine similarity 6 7 6 7 6 7 s4 4 5 4 5 4 5 between the blocks’ vectors s5 s4 s4 s4 s6 . . . sum . sum . sum . s7 2 3 2 3 2 3 ··· 6 7 6 7 6 7 s 6 s9 7 6 s9 7 6 s9 7 8 6 7 6 7 6 7 s 6 7 6 7 6 7 9 46 57 46 57 46 57 . .

Score vector S: b b b 1,2 2,3 3,4 ···

10 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

The TextTiling algorithm (Hearst 1994, 1997)

Score vector S: b b b 1,2 2,3 3,4 ··· 1 Smooth S using average smoothing over window size a to get Sˆ. (Sˆ) 2 Set number of boundaries B as µ(Sˆ) 2 3 Score each boundary bi using (bi 1 bi )+(bi+1 bi ) 4 Choose the top B boundaries by these scores.

10 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Dotplotting (Reynar 1994, 1998) bulldogs bulldogs ﬁght also ﬁght buffalo that buffalo buffalo also buffalo 123456 7 8 9 10 11

Where word w appears in positions x and y in a single document, add points (x, x), (y, y), (x, y), and (y, x):

11 buff buff buff buff

10 also also

9 buff buff buff buff

8 buff buff buff buff

7 that that

6 buff buff buff buff

5 figh figh

4 also also Position in concatenated docs concatenated in Position

3 figh figh

2 bull bull

1 that that

bull bull

1 2 3 4 5 6 7 8 9 10 11

Position in concatenated docs

11 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Dotplotting (Reynar 1994, 1998) bulldogs bulldogs ﬁght also ﬁght buffalo that buffalo buffalo also buffalo 123456 7 8 9 10 11

11 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Dotplotting (Reynar 1994, 1998) bulldogs bulldogs ﬁght also ﬁght buffalo that buffalo buffalo also buffalo 123456 7 8 9 10 11

Deﬁnition (Minimize the density of the regions around the sentences) n = the length of the concatenated texts • m = the vocabulary size • Boundaries initialized as [0] • P = Boundaries + j • j Vector of length m containing the number of times each vocab item occurs • between positions x and y For a desired number of boundaries B, use dynamic programming to ﬁnd the B indices that minimize

P | | VPj 1,Pj VPj ,n · (Pj Pj 1)(n Pj ) Xj=2 Examples (Vocab = (also, buffalo, bulldogs, ﬁght, that))

[1, 0, 2, 2, 0] [1, 4, 0, 0, 1] [1, 1, 2, 2, 0] [1, 3, 0, 0, 1] P =[0, 5] · = 0.03 P =[0, 6] · = 0.13 ) (5 0)(11 5) ) (6 0)(11 6) 11 / 48 is computed using the cosine measure as shown in Each value in the similarity matrix is replaced by equation 1. This is applied to all sentence pairs to its rank in the local region. The rank is the num- generate a similarity matrix. ber of neighbouring elements with a lower similarity value. Figure 2 shows an example of image ranking E:, f.~., x :~., using a 3 x 3 rank mask with output range {0, 8}. sim(x,y) = ~_~., .~., ,,, For segmentation, we used a 11 x 11 rank mask. The is computed using the cosine measure as shown in Each value in the similarityf2.xEjf2. matrix is replaced (1)by Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion output is expressed as a ratio r (equation 2) to cir- equation 1. This is applied to all sentence pairs to its rank in the local region. The rank is the num- cumvent normalisation problems (consider the cases generate a similarity matrix. ber of neighbouring elements with a lower similarity Figure 1 shows an example of a similarity matrix ~. when the rank mask is not contained in the image). value.High similarityFigure 2 valuesshows arean examplerepresented of imageby bright ranking pix- E:, f.~., x :~., usingels. The a 3 bottom-left x 3 rank mask and withtop-right output pixel range show {0, the8}. sim(x,y) = ~_~., .~., ,,, Similarity matrix Rank matrix Similarity matrix Rank matrix f2.xEjf2. (1) Forself-similarity segmentation, for thewe firstused anda 11 lastx 11 sentence,rank mask. respec- The outputtively. Noticeis expressed the matrix as a ratiois symmetric r (equation and 2)contains to cir- 2 3 2 I -J I 5 4 ~7 1~ ; 4, 7 a • Divisive clustering (Choi 2000) cumventbright square normalisation regions along problems the diagonal. (consider Thesethe cases re- Figure 1 shows an example of a similarity matrix ~. i7 8,6 3 i 7 whengions therepresent rank maskcohesive is not text contained segments. in the image). High similarity values are represented by bright pix- 91t[6 I els. The bottom-left and top-right pixel show the Step 1 Step 3 Similarity matrix Rank matrix Similarity matrix Rank matrix self-similarity for the first and last sentence, respec- 12 tively. Notice the matrix is symmetric and contains 2 3 2 I -J I 5 4 ~7 1~ ; 4, 7 a • bright square regions along the diagonal. These re- i2.1' ~t i7 8,6 3 i 7 LTl~ '6,, ~ 7 , gions represent cohesive text segments. 91t[6 I 14 is computed using the cosine measure as shown in Each value in the Step similarity 1 matrix is replacedStep 3 by Step 2 Step 4 Compare all sentences pairwiseequation 1. This for is applied cosine to all sentence similarity, pairs to its rank to in the local region. The rank is the num- 1 generate a similarity matrix. ber of neighbouring elements with a 12lower similarity value. Figurei2.1' 2 shows~t an example of image ranking LTl~ '6,, ~ 7 , Figure 2: A working example of image ranking. create a matrix of similarity values. E:, f.~., x :~., using a 3 x 3 rank mask with output range {0, 8}. sim(x,y) = ~_~., .~., ,,, For segmentation, we used a 11 x 11 14rank mask. The f2.xEjf2. (1) Step 2 Step 4 output is expressed as a ratio r (equation 2) to cir- # of elements with a lower value r = (2) cumvent normalisation problems (consider the cases # of elements examined Figure 1 shows an example of a similarity matrix ~. when the rank mask is not contained in the image). High similarity values are represented by bright pix- Figure 2: A working example of image ranking. To demonstrate the effect of image ranking, the els. The bottom-left and top-right pixel show the Figure 1: An example similarity matrix. Similarity matrix Rank matrix Similarity matrix Rank matrix process was applied to the matrix shown in figure 1 self-similarity for the first and last sentence, respec- to produce figure 32 . Notice the contrast has been For each value s, ﬁnd the n tively.n Noticesubmatrix the matrix is symmetricNs withand contains 3.2 Ranking# of elements2 3 2 with a lowerI value-J I 5 4 ~7 1~ r = ; 4, 7 a • (2) improved significantly. Figure 4 illustrates the more bright square regions along the diagonal. These re- For short text # segments,of elements the examined absolute value of ⇥ i7 8,6 3 i 7 subtle effects of our ranking scheme, r(x) is the rank gions represent cohesive text segments. sire(x, y) is unreliable. An additional occurrence of (1 x 11 mask) of which is a sine wave with s at its center and replace s with the value To demonstrate the91t[6 effect of image ranking,I the f(x) Figure 1: An example similarity matrix. a common word (reflected in the numerator) causes decaying mean, amplitude and frequency (equation processStep 1 was applied to the matrixStep 3 shown in figure 1 a disproportionate increase in sim(x,y) unless the 2 to produce figure 32 . Notice the contrast has been 3). denominator (related12 to segment length) is large. 3.2 Ranking improved significantly. Figure 4 illustrates the more Thus, in the context of text segmentation where a s0 Ns : s0 < s For short text segments, the absolute valuei2.1' of ~t subtle effects of our ranking scheme, r(x) is the rank LTl~ '6,, segment~ 7has , typically < 100 informative tokens, one sire(x, y) is unreliable. An additional occurrence of (1 x 11 mask) of which is a sine wave with |{ 2 }| can only use the metricf(x) to estimate the order of sim- 2 a common word (reflected in the numerator) causes decaying mean, amplitude14 and frequency (equation ilarityStep 2 between sentences, e.g.Step a 4 is more similar to b n a disproportionate increase in sim(x,y) unless the 3).than c. denominator (related to segment length) is large. Furthermore, language usage varies throughout a Thus, in the context of text segmentation where a document. For instance, the introduction section of segment has typically < 100 informative tokens, Figureone 2: A working example of image ranking. a document is less cohesive than a section which is can only use the metric to estimate the order of sim- about a particular topic. Consequently, it is inap- ilarity between sentences, e.g. a is more similar to b propriate to directly compare the similarity values than c. # of elements with a lower value Apply something akin to Reynar’s algorithm to ﬁnd the from different regions of the similarity matrix. Furthermore, language usage varies throughout a r = (2) In non-parametric# of elements examinedstatistical analysis, one com- document. For instance, the introduction section of pares the rank of data sets when the qualitative be- 3 cluster boundaries (which are clearera document as is less a cohesive result than a section ofthe whichTo is demonstrate the effect of image ranking, the Figure 1: An example similarity matrix. haviour is similar but the absolute quantities are un- about a particular topic. Consequently, it is processinap- was applied to the matrix shown in figure 1 reliable. We present a ranking scheme which is an propriate to directly compare the similarity valuesto produce figure 32 . Notice the contrast has been local smoothing adaptation of that described in (O'Neil and Denos, Figure 3: The matrix in figure 1 after ranking. 3.2 Rankingfrom different regions of the similarity matrix.improved significantly. Figure 4 illustrates the more 1992). For short textIn non-parametricsegments, the statisticalabsolute analysis,value of one subtlecom- effects of our ranking scheme, r(x) is the rank 2The process was applied to the original matrix, prior to sire(x, y) is paresunreliable. the rank An of additional data sets whenoccurrence the qualitative of (1 be- x 11 mask)1The of contrast f(x) ofwhich the image is a has sine been wave adjusted with to highlight contra.st enhancement. The output image has not been en- the image features. hanced. a common wordhaviour (reflected is similar in butthe thenumerator) absolute quantitiescauses aredecaying un- mean, amplitude and frequency (equation a disproportionatereliable. increaseWe present in a rankingunless scheme the which is an sim(x,y) 3). Figure 3: The matrix in figure 1 after ranking. denominatoradaptation (related toof thatsegment described length) in (O'Neilis large. and Denos, 1992). Thus, in the context of text segmentation where a 2The process was applied to the original matrix, prior to27 :97 Choi (2000) reports substantial accuracysegment has typically1Thegains contrast < 100 of theinformative overimage has beentokens, both adjusted one toTextTiling highlight contra.st enhancement. and The output image has not been en- can only usethe the image metric features. to estimate the order of sim- hanced. ilarity between sentences, e.g. a is more similar to b dotplotting. than c. Furthermore, language usage varies throughout a 27:97 document. For instance, the introduction section of a document is less cohesive than a section which is about a particular topic. Consequently, it is inap- 12 / 48 propriate to directly compare the similarity values from different regions of the similarity matrix. In non-parametric statistical analysis, one com- pares the rank of data sets when the qualitative be- haviour is similar but the absolute quantities are unreliable. We present a ranking scheme which is an adaptation of that described in (O'Neil and Denos, Figure 3: The matrix in figure 1 after ranking. 1992). 2The process was applied to the original matrix, prior to 1The contrast of the image has been adjusted to highlight contra.st enhancement. The output image has not been en- the image features. hanced.

27:97 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Supervised

1 Label segment boundaries in training and test set.

2 Extract features in training: generally a superset of the features used by unsupervised approaches.

3 Fit a classiﬁer model (NaiveBayes, MaxEnt, SVM, . . . ).

4 In testing, apply feature to predict boundaries.

(Manning 1998; Beeferman et al. 1999; Sharp and Chibelushi 2008)

(Slide from Dan Jurafsky.)

13 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Evaluation: WindowDiff (Pevzner and Hearst 2002)

Deﬁnition (WindowDiff) b(i, j) = the number of boundaries between text positions i and j • N = the number of sentences • N k 1 WindowDiff(ref, hyp)= b(ref , ref + ) b(hyp , hyp ) , 0 N k i i k i i+k i=1 X ✓ ◆ Return values: 0 = all labels correct; 1 = no labels correct

(Jurafsky and Martin 2009: 21) § 14 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Discourse coherence theories

Halliday and Hasan (1976): Additive, Temporal, Causal, Adversative • Longacre (1983): Conjoining, Temporal, Implication, Alternation • Martin (1992): Addition, Temporal, Consequential, Comparison • Kehler (2002): Result, Explanation, Violated Expectation, Denial of • Preventer, Parallel, Contrast (i), Contrast (ii), Exempliﬁcation, Generalization, Exception (i), Exception (ii), Elaboration, Occasion (i), Occasion (ii) Hobbs (1985): Occasion, Cause, Explanation, Evaluation Background, • Exempliﬁcation, Elaboration, Parallel, Contrast, Violated Expectation Wolf and Gibson (2005): Condition, Violated expectation, Similarity, • Contrast, Elaboration, Example, Elaboration, Generalization, Attribution, Temporal Sequence, Same

15 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Rhetorical Structure Theory (RST) Relations hold between adjacent spans of text: the nucleus and the satellite. Each relation has ﬁve ﬁelds: constraints on nucleus, constraints on satellite, constraints on nucleus–satellite combination, effect, and locus of effect.

(Mann and Thompson 1988)

16 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Coherence structures Wolf and Gibson Representing Discourse Coherence

From Wolf and Gibson (2005)

1 a. Mr. Baker’s assistant for inter-American affairs, b. Bernard Aronson

2 while maintaining

3 that the Sandinistas had also broken the cease-ﬁre,

Figure4 acknowledged: 4 Coherence graph for example (23). expv = violated expectation; elab = elaboration; attr = attribution. 5 “It’s never very clear who starts what.”

Figure 5 Coherence graph for example (23) with discourse segment 1 split into two segments. expv = violated expectation; elab = elaboration; attr = attribution.

17 / 48

Figure 6 Tree-based RST annotation for example (23) from Carlson, Marcu, and Okurowski (2002). Broken lines represent the start of asymmetric coherence relations; continuous lines represent the end of asymmetric coherence relations; symmetric coherence relations have two continuous lines (cf. section 2.3). attr = attribution; elab = elaboration.

The annotations based on our annotation scheme with the discourse segmentation based on the segmentation guidelines in Carlson, Marcu, and Okurowski (2002) are presented in Figure 4, and those with the discourse segmentation based on our segmentation guidelines from section 2.1 are presented in Figure 5. Figure 6 shows atree-basedRSTannotationforexample(23)fromCarlson,Marcu,andOkurowski (2002). The only difference between our approach and that of Carlson, Marcu, and Okurowski with respect to how example (23) is segmented is that Carlson and her colleagues assume discourse segment 1 to be one single segment. By contrast, based on our segmentation guidelines, discourse segment 1 would be segmented into two segments (because of the comma that does not separate a complex NP or VP), 1a and 1b, as indicated by the brackets in example (24):4

4 Based on our segmentation guidelines, the complementizer that in discourse segment 3 would be part of discourse segment 2 instead (cf. (15)). However, since this would not make a difference in terms of the resulting discourse structure, we do not provide alternative analyses with that as part of discourse segment 2 instead of discourse segment 3.

267 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Features for coherence recognition (complete in class)

Addition •

Temporal •

Contrast •

Causation •

18 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

The Penn Discourse Treebank 2.0 (Webber et al. 2003)

Large-scale effort to identify the coherence relations that hold between • pieces of information in discourse. Available from the Linguistic Data Consortium. • Annotators identiﬁed spans of text as the coherence relations. Where the • relation was implicit, they picked their own lexical items to ﬁll the role.

Example

[Arg1 that hung over parts of the factory ] even though

[Arg2 exhaust fans ventilated the area ].

19 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

A complex example

[Arg1 Factory orders and construction outlays were largely ﬂat in December ] while purchasing agents said

[Arg2 manufacturing shrank further in October ].

20 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

The overall structure of examples

Don’t try to take it all in at once. It’s too big! Figure out what question you want to address and then focus on the parts of the corpus that matter for it. A brief run-down: Relation-types: Explicit, Implicit, AltLex, EntRel, NoRel • Connective semantics: hierarchical; lots of levels of granularity to work with, • from four abstract classes down to clusters of phrases and lexical items Attribution: tracking who is committed to what • Structure: Every piece of text is associated with a set of subtrees from the • WSJ portion of the Penn Treebank 3.

21 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Connectives

PDTB relation Examples Explicit 18,459 Implicit 16,053 AltLex 624 EntRel 5,210 NoRel 254 Total 40,600

22 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Explicit connectives

[Arg1 that hung over parts of the factory ] even though

[Arg2 exhaust fans ventilated the area ].

23 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Explicit connectives

23 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Implicit connectives

[Arg1 Some have raised their cash positions to record levels ]. Implicit = BECAUSE

[Arg2 High cash positions help buffer a fund when the market falls ].

24 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Implicit connectives

24 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

AltLex connectives

[Arg1 Ms. Bartlett’s previous work, which earned her an international reputation in the non-horticultural art world, often took gardens as its nominal subject ].

[Arg2 Mayhap this metaphorical connection made the BPC Fine Arts Committee think she had a literal green thumb ].

25 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

AltLex connectives

25 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Connectives and their semantics

Figure 1: Hierarchy of sense tags

e.g., “Contrast” vs “Concession”. Cases when one anno- in parentheses at the top row. The total of sense tags ap- tator picked a class level tag, e.g., “COMPARISON”, and plied to these categories is(from shown at Prasadthe bottom of et the table. al. 2008) the other picked a type level tag of the same class, e.g., The numbers differ because some tokens may have been “Contrast”, did not count as disagreement. At the sub- annotated with two senses. type level, disagreement was noted when the two annotators Table 5 shows the top ten most polysemousconnectivesand picked different subtypes, e.g., “expectation” vs. “contra- the distribution of their sense tags. The total number of 26 / 48 expectation”. Higher level disagreement was counted as tokens whose sense tags occurred less than ten times are disagreement at all the levels below. Inter-annotator agree- shown as other.Theconnectivesafter, since and when, ment is shown in Table 3. Percent agreement, computed for which typically relate non-simultaneous situations, are am- five sections (5092 tokens), is shown for each level. Agree- biguous between “TEMPORAL” and “CONTINGENCY” ment is high for all levels, ranging from 94% at the class senses. The connectives while and meanwhile,whichtyp- level to 80% at the subtype level. ically relate simultaneous situations, are ambiguous be- Class level disagreementwas adjudicated by a team of three tween the “TEMPORAL” and “COMPARISON” senses. experts. Disagreement at lower levels was resolved by pro- The connectives but, however and although are ambigu- viding a sense tag from the immediately higher level. For ous between the “Contrast” and “Concession” types and example, if one annotator tagged a token with the type subtypes of “COMPARISON” but rarely between different “Concession” and the other, with the type “Contrast”, the classes of senses. The connective if is ambiguous between disagreement was resolved by applying the higher level tag subtypes of “Condition” and some pragmatic uses. “COMPARISON”. This decision was based on the assump- tion that both interpretations were possible, making it hard 4. Attribution Annotation to determine with confidence which one was intended. Recent work (Wiebe et al., 2005; Prasad et al., 2005) has LEVEL %AGREEMENT shown the importance of attributing beliefs and assertions CLASS 94% expressed in text to the agent(s) holding or making them. TYPE 84% Such attributions are a common feature in the PDTB cor- SUBTYPE 80% pus which belongs to the news domain. Since the discourse relations in the PDTB are annotated between abstract ob- Table 3: Inter-annotator agreement jects, with the relations themselves denoting a class of abstract objects (called “relational propositions” (Mann and Table 4 shows the distribution of “CLASS” level tags in the Thompson, 1988)), one can distinguish a variety of cases corpus. Each “CLASS” count includes all the annotations depending on the attribution of the discourse relation or its of the specified “CLASS” tag and all its types and subtypes. arguments: that is, whether the relation and its arguments The total of Explicit, Implicit and AltLex tokens is shown are attributed to the writer (e.g., attribution to the writerin Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

The relationship between relation-types and connectives

Comparison Contingency Expansion Temporal AltLex 46 275 217 86 Explicit 5471 3250 6298 3440 Implicit 2441 4185 8601 826

27 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

The distribution of semantic classes

28 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Connectives by relation type

(a) Explicit. (b) Implicit.

Figure: Wordle representations of the connectives, by relation type.

29 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

EntRel and NoRel

[Arg1 Hale Milgrim, 41 years old, senior vice president, marketing at Elecktra En- tertainment Inc., was named president of Capitol Records Inc., a unit of this enter- tainment concern ].

[Arg2 Mr. Milgrim succeeds David Berman, who resigned last month ].

30 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Arguments

31 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Attributions

[Arg1 Factory orders and construction outlays were largely ﬂat in December ] while (Comparison:Contrast:Juxtaposition) purchasing agents said

[Arg2 manufacturing shrank further in October ].

32 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Attributions

32 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Attributions

Attribution strings researchers said A Lorillard spokewoman said A Lorillard spokewoman said said Darrell Phillips, vice president of human resources for Hollingsworth & Vose said Darrell Phillips, vice president of human resources for Hollingsworth & Vose Longer maturities are thought Shorter maturities are considered considered by some said Brenda Malizia Negus, editor of Money Fund Report the Treasury said The Treasury said Newsweek said said Mr. Spoon According to Audit Bureau of Circulations According to Audit Bureau of Circulations saying that . . .

32 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Some informal experimental results: experimental set-up

Training set of 2,400 examples: 600 randomly chosen examples from each • of the four primary PDTB semantic classes: Comparison, Contingency, Expansion, Temporal. Test set of 800 examples: 200 randomly chosen examples from each of the • four primary semantic classes. The students in my LSA class ‘Computational Pragmatics’ formed two • teams, and I was a team one one,

and each team speciﬁed features, which I implemented using NLTK Python’s MaxEnt interface.

33 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Some informal experimental results: Team Potts

Accuracy: 0.41 Feature count: 632,559 Train set accuracy: 1.0

1 Verb pairs: features for verb pairs (V1, V2) where where V1 was drawn from Arg1 and V2 from Arg2.

2 Inquirer pairs: features for the cross product of the Harvard Inquirer semantic classes for Arg1 and Arg2 (after Pitler et al. 2009).

34 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Some informal experimental results: Team Banana Wugs

Accuracy: 0.34 Feature count: 116 Train set accuracy: 0.37

1 Negation: features capturing (sentential and constituent) negation balances and imbalances across the Args.

2 Sentiment: A separate sentiment score for each Arg.

3 Overlap: the cardinality of the intersection of the Arg1 and Arg2 words divided by their union.

4 Structural complexity: features capturing, for each Arg, whether it has an embedded clause, the number of embedded clauses, and the height of its largest tree.

5 Complexity ratios: a feature for log of the ratio of the lengths (in words) of the two Args, a feature for the ratio of the clause-counts for the two Args, and a feature for the ratio of the max heights for the two Args.

6 Pronominal subjects: a pair-feature capturing whether the subject of the Arg is pronominal (pro) or non-pronominal (non-pro). The features are pairs from pro, non-pro pro, non-pro . { }⇥{ } 7 It seems: returns False if the ﬁrst argument of the second bigram is not it seems.features

8 Tense agreement: a feature for the degree to which the verbal nodes in the two Args have the same tense.

9 Modals: a pair-feature capturing whether Arg contains a modal (modal) or not (non-modal). The features are pairs from modal, non-modal modal, non-modal . { }⇥{ }

35 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Some informal experimental results: Team Banana Slugs

Accuracy: 0.38 Feature count: 1,824 Train set accuracy: 0.73

1 Negation: for each Arg, a feature for whether it was negated and the number of negation it contains. Also, a feature capturing negation balance/imbalance across the Args.

2 Main verbs: for each Arg, a feature for its main-verb. Also, a feature returning True of the two Args’ main verbs match, else False.

3 Length ratio: a feature for the ratio of the lengths (in words) of Arg1 and Arg2.

4 WordNet antonyms: the number of words in Arg2 that are antonyms of a word in Arg1.

5 Genre: a feature for the genre of the ﬁle containing the example.

6 Modals: for each Arg, the number of modals in it.

7 WordNet hypernym counts: for Arg1, a feature for the number of words in Arg2 that are hypernyms of a word in Arg1, and ditto for Arg2.

8 N-gram features: for each Arg, a feature for each unigram it contains. (The team suggested going to 2- or 3-grams, but I called a halt at 1 because the data-set is not that big.)

36 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Some informal experimental results: Who won?

Accuracy: 0.41 Feature count: 632,559 Train set accuracy: 1.0

Accuracy: 0.34 Feature count: 116 Train set accuracy: 0.37

Accuracy: 0.38 Feature count: 1,824 Train set accuracy: 0.73

37 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Unsupervised discovery of coherence relations (Marcu and Echihabi 2002)

Marcu and Echihabi (2002) focus on four coherence relations that can be informally mapped to coherence relations from other theories:

Possible PDTB mapping given in red; might want to use to the supercategories.

38 / 48 CONTRAST CAUSE-EXPLANATION-EVIDENCE ELABORATION CONDITION ANTITHESIS (M&T) EVIDENCE (M&T) ELABORATION (M&T) CONDITION (M&T) CONCESSION (M&T) VOLITIONAL-CAUSE (M&T) EXPANSION (Ho) OTHERWISE (M&T) NONVOLITIONAL-CAUSE (M&T) EXEMPLIFICATION (Ho) Overview Discourse segmentation Discourse coherence theories Penn DiscourseCONT TreebankRAST (M&T) 2.0 UnsupervisedVOLIT coherenceIONAL-RESULT (M&T)Conclusion ELABORATION (A&L) VIOLATED EXPECTATION (Ho) NONVOLITIONAL-RESULT (M&T) EXPLANATION (Ho) ( CAUSAL ADDITIVE ) - RESULT (A&L) Automatically collected labels ( SEMANTIC PRAGMATIC ) - EXPLANATION (A&L) NEGATIVE (K&S) CAUSAL - (SEMANTIC PRAGMATIC ) - Data POSITIVE (K&S) RAW: 41 million sentences ( 1 billion words)Table 1: fromRelation a varietydeﬁnitions ofas LDCunion of corporadeﬁnitions proposed by other researchers (M&T – (Mann and • ⇡ Thompson, 1988); Ho – (Hobbs, 1990); A&L – (Lascarides and Asher, 1993); K&S – (Knott and Sanders, BLIPP: 1.8 million Charniak parsed sentences • 1998)).

CONTRAST – 3,881,588 examples extract examples of CONTRAST relations that hold [BOS EOS] [BOS But EOS] [BOS ] [but EOS] between a span of text delimited to the left by the [BOS ] [although EOS] cue phrase “Although” occurring in the beginning of [BOS Although ,] [ EOS] a sentence and to the right by the ﬁrst occurrence of CAUSE-EXPLANATION-EVIDENCE — 889,946 examples [BOS ] [because EOS] a comma, and a span of text that contains the rest of Labeling method [BOS Because ,] [ EOS] the sentence to which “Although” belongs. [BOS EOS] [BOS Thus, EOS] We also extracted automatically 1,000,000 exam- 1 Extract all sentences matching CONDITION — 1,203,813 examples [BOS If ,] [ EOS] ples of what we hypothesize to be non-relations, by one of the patterns. [BOS If ] [then EOS] randomly selecting non-adjacent sentence pairs that [BOS ] [if EOS] are at least 3 sentences apart in a given text. We label ELABORATION — 1,836,227 examples 2 Label the connective with the [BOS EOS] [BOS for example EOS] such examples NO-RELATION-SAME-TEXT. And [BOS ] [which ,] we extracted automatically 1,000,000 examples of name of the pattern. NO-RELATION-SAME-TEXT — 1,000,000 examples Randomly extract two sentences that are more what we hypothesize to be cross-document non- 3 Treat everything before the than 3 sentences apart in a given text. relations, by randomly selecting two sentences from NO-RELATION-DIFFERENT-TEXTS — 1,000,000 examples distinct documents. As in the case of CONTRAST connective as Arg1 and Randomly extract two sentences from two different documents. and CONDITION, the NO-RELATION examples are everything after it as Arg2. also noisy because long distance relations are com- Table 2: Patterns used to automatically construct a mon in well-written texts. corpus of text span pairs labeled with discourse re- 3 Determining discourse relations using lations. Naive Bayes classiﬁers

We hypothesize that we can determine that a CON- 39 / 48 CONDITION relations and the number of examples TRAST relation holds between the sentences in (3) extracted from the Raw corpus for each type of dis- even if we cannot semantically interpret the two sen- course relation. In the patterns in Table 2, the sym- tences, simply because our background knowledge bols BOS and EOS denote BeginningOfSentence tells us that good and fails are good indicators of and EndOfSentence boundaries, the “ ” stand for contrastive statements. occurrences of any words and punctuation marks, the square brackets stand for text span boundaries, John is good in math and sciences. (3) and the other words and punctuation marks stand for Paul fails almost every class he takes. the cue phrases that we used in order to extract discourse relation examples. For example, the pattern Similarly, we hypothesize that we can determine that [BOS Although ,] [ EOS] is used in order to a CONTRAST relation holds between the sentences Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Naive Bayes model

1 count(wi , wj , r)=the number of times that word wi occurs in Arg1 and wj occurs in Arg2 with coherence relation r.

2 W = the full vocabulary

3 R = the set of coherence relations

4 N = (w ,w ) W W,r R count(wi , wj , r) i j 2 ⇥ 2 count(w ,w ,r) (wi ,wj ) W W i j 5 P(r)=P 2 ⇥ P N 6 Estimate P (w , w ) r with i j | ⇣ ⌘ count(wi , wj , r)+1

(w ,w ) W W count(wx , wy , r)+N x y 2 ⇥

7 Maximum likelihood estimatesP for example with W1 the words in Arg1 and W2 the words in Arg2:

arg max P(r) P (w , w ) r r i j | 2 (w ,w ) W W 3 6 i jY2 1⇥ 2 ⇣ ⌘7 6 7 (Connectives are excluded from46 these calculations, since57 they were used to obtain the labels.)

40 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Results for pairwise classiﬁers CONTRAST CEV COND ELAB NO-REL-SAME-TEXT NO-REL-DIFF-TEXTS CONTRAST - 87 74 82 64 64 CEV CONT76RAST C93EV CO75ND ELAB NO-REL-S74AME-TEXT NO-REL-DIFF-TEXTS COND CONTRAST - 8789 7469 82 64 71 64 ELAB CEV 7676 93 75 75 74 NO-REL-SAME-TEXCTOND 89 69 64 71 ELAB 76 75 NO-REL-SAME-TEXT 64

Table 3: Performances of classiﬁers trained on the Raw corpus. The baseline in all cases is 50%. Table 3: Performances of classiﬁers trained on the Raw corpus. The baseline in all cases is 50%. CONTRAST CEV COND ELAB NO-REL-SAME-TEXT NO-REL-DIFF-TEXTS CONTRAST - 62 CONT58RAST C78EV CO64ND ELAB NO-REL-S72AME-TEXT NO-REL-DIFF-TEXTS CEV CONTRAST - 69 6282 5864 78 64 68 72 COND CEV 78 6963 82 64 65 68 ELAB COND 78 78 63 78 65 NO-REL-SAME-TEXETLAB 78 66 78 NO-REL-SAME-TEXT 66

Table 4: Performances of classiﬁers trained on the BLIPP corpus. The baseline in all cases is 50%. Table 4: Performances of classiﬁers trained on the BLIPP corpus. The baseline in all cases is 50%.

a simple program that extracted the nouns, verbs, a simple program that extracted the nouns, verbs, and cue phrases in each sentence/clause. We and cue phrases in each sentence/clause. We call these the mostcall theserepresentativethe most rworepresentativeds of a sen-words of a sentence/discourse tence/discourseunit. For example,unit. theFormostexample,repre-the most representative words sentatiof thevsentencee words ofintheexamplesentence(4)in, areexample (4), are Systemsthose trainedshown in italics. onthose sho thewn in italics. smaller, higher- precision BLIPPItaly’s unadjusted corpusindustrialItaly’s haveunadjustedproduction lowerindustrialfell in prJan-oduction overall(4)fell in ac-Jan- (4) uary 3.4% from a yearuary earlier3.4% frombut raoseyear0.4%earlierfrombut rose 0.4% from curacy, but theyDecember perform, the governmentDecember bettersaid, the government withsaid less data than those trainedWe repeated ontheWe thexperimente repeated RAWthewe ecarriedxperiment corpus.out wein con-carried out in conjunction with thejunctionRaw corpuswith theonRathew corpusdata derionvedthe data derived from the BLIPPfromcorpustheasBLIPPwell. corpusTable 4assummarizeswell. Table 4 summarizes the results. the results. Figure 1: LearningFigure 1:curvLearninges for thecurvEesLAforBORtheATIEOLNABORATION Overall, the performance of the systems trained Overall, the performance of the systems trained vs. CAUSE-Evs.XPLCAAUNASTEI-OENX-PELVAINDAETNIOCNE-Eclassifiers,VIDENCE classifiers, on the most representation the mostve wrepresentatiord pairs invethewordBLIPPpairs in trainedthe BLIPPon thetrainedRaw andon theBLIPPRawcorpora.and BLIPP corpora. corpus is clearlycorpusloweristhanclearlythe loperformancewer than theofperformancethe of the 41 / 48 systems trained on all the word pairs in the Raw systems trained on all the word pairs in the Raw the learning curve for the Raw corpus, this suggests corpus. But a direct comparison betweenthetwlearningo clas- curve for the Raw corpus, this suggests corpus. But a direct comparison between two clas- that discourse relation classifiers trained on most sifiers trained on different corpora is notthatfairdiscoursebe- relation classifiers trained on most sifiers trained on different corpora is not fair be- representative word pairs and millions of training cause with just 100,000 examples per relation,representatithe ve word pairs and millions of training cause with just 100,000 examples per relation, the examples can achieve higher levels of performance systems trained on the Raw corpus are muchexamplesworsecan achieve higher levels of performance systems trained on the Raw corpus are much worse than classifiers trained on all word pairs (unanno- than those trained on the BLIPP data. Thethanlearningclassifiers trained on all word pairs (unanno- than those trained on the BLIPP data. The learning tated data). curves in Figure 1 are illuminating as theytatedshowdata).that curves in Figure 1 are illuminating as they show that if one uses as features only the most representative 4 Relevance to RST if one uses as featuresword pairs,onlyonethe needsmost representationly about 100,000ve 4 trainingRelevance to RST word pairs, oneeneedsxamplesonlyto achieaboutve100,000the sametraininglevel of performance The results in Section 3 indicate clearly that massive examples to achieoneveachiethe vsamees usinglevel1,000,000of performancetraining examplesThe resultsand inamountsSectionof3 indicateautomaticallyclearlygeneratedthat massidatavecan be used one achieves usingfeatures1,000,000definedtrainingover allexamplesword pairs.andAlso,amountssince theof automaticallyto distinguishgeneratedbetween datadiscoursecan berelationsused defined features definedlearningover all curvworde forpairs.theAlso,BLIPPsincecorpustheis steeperto distinguishthan asbetweendiscusseddiscoursein Sectionrelations2.2. Whatdefinedthe experiments learning curve for the BLIPP corpus is steeper than as discussed in Section 2.2. What the experiments Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Results for the RST corpus of Carlson et al. 2001

CONTR CEV COND ELAB can be used to distinguish between certain types of # test cases 238 307 125 1761 RST relations. For example, the results show that the classifiers can be used to distinguish between For this experiment, the classifiers were trained CONTR — 63 56 80 65 64 88 CEV 87 71 76 85 CONTRAST and CAUSE-EXPLANATION-EVIDENCE on the RAW corpus, with the connectives in- COND 87 93 relations, as defined in RST, but not so well between cluded as features. Only RST examples involv- ELABORATION and any other relation. This result is consistent with the discourse model proposed by ing (approximations of) the four relations used Table 5: Performances of Raw-trained classifiers on Knott et al. (2001), who suggest that ELABORATION manually labeled RST relations that hold between relations are too ill-defined to be part of any dis- above were in the test set. elementary discourse units. Performance results are course theory. shown in bold; baselines are shown in normal fonts. The analysis above is informative only from a machine learning perspective. From a linguistic perspective though, this analysis is not very use- in Section 3 do not show is whether the classifiers ful. If no cue phrases are used to signal the re- built in this manner can be of any use in conjunction lation between two elementary discourse units, an Identifying implicit relations with some established discourse theory. To test this, automatic discourse labeler can at best guess that we used the corpus of discourse trees built in the an ELABORATION relation holds between the units, style of RST by Carlson et al. (2001). We automati- The RAW-trained classifier is able to accurately guess a large number of implicit because ELABORATION relations are the most fre- cally extracted from this manually annotated corpus quently used relations (Carlson et al., 2001). Fortu- examples, essentially because it saw similar examplesall CONTR withAST, CAU anSE- overtEXPLANA connectiveTION-EVIDENCE, nately, with the classifiers described here, one can CONDITION and ELABORATION relations that hold label some of the unmarked discourse relations cor- (which served as the label). between two adjacent elementary discourse units. rectly. Since RST (Mann and Thompson, 1988) employs For example, the RST-annotated corpus of Carl- a finer grained taxonomy of relations than we used, son et al. (2001) contains 238 CONTRAST rela- we applied the definitions shown in Table 1. That is, tions that hold between two adjacent elementary dis- In sum: an example of the ‘unreasonable effectivenesswe considered of data’that a CO (BankoNTRAST relation andheld Brillbe- course units. Of these, only 61 are marked by a cue tween two text spans if a human annotator labeled phrase, which means that a program trained only 2001; Halevy et al. 2009). the relation between those spans as ANTITHESIS, on Carlson et al.’s corpus could identify at most CONCESSION OTHERWISE CONTRAST , or . We re- 61/238 of the CONTRAST relations correctly. Be- trained then all classifiers on the Raw corpus, but cause Carlson et al.’s corpus is small, all unmarked this time without removing from the corpus the cue relations will be likely labeled as ELABORATIONs. phrases that were used to generate the training ex- However, when we run our CONTRAST vs. ELAB- 42 / 48 amples. We did this because when trying to deter- ORATION classifier on these examples, we can la- mine whether a CONTRAST relation holds between bel correctly 60 of the 61 cue-phrase marked re- two spans of texts separated by the cue phrase “but”, lations and, in addition, we can also label 123 of for example, we want to take advantage of the cue the 177 relations that are not marked explicitly with phrase occurrence as well. We employed our clas- cue phrases. This means that our classifier con- sifiers on the manually labeled examples extracted tributes to an increase in accuracy from from Carlson et al.’s corpus (2001). Table 5 displays to !!! Similarly, out the performance of our two way classifiers for rela- of the 307 CAUSE-EXPLANATION-EVIDENCE relations defined over elementary discourse units. The tions that hold between two discourse units in Carl- table displays in the second row, for each discourse son et al.’s corpus, only 79 are explicitly marked. relation, the number of examples extracted from the A program trained only on Carlson et al.’s cor- RST corpus. For each binary classifier, the table lists pus, would, therefore, identify at most 79 of the in bold the accuracy of our classifier and in non-bold 307 relations correctly. When we run our CAUSE- font the majority baseline associated with it. EXPLANATION-EVIDENCE vs. ELABORATION clas- The results in Table 5 show that the classifiers sifier on these examples, we labeled correctly 73 learned from automatically generated training data of the 79 cue-phrase-marked relations and 102 of Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Data and tools

Penn Discourse Treebank 2.0 • LDC: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp? • catalogId=LDC2008T05

Project page: http://www.seas.upenn.edu/ pdtb/ • ˜

Python tools/code: http://compprag.christopherpotts.net/pdtb.html •

Rhetorical Structure Theory • LDC: http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp? • catalogId=LDC2002T07

Project page: http://www.sfu.ca/rst/ •

43 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

Prospects

Text segmentation Seems to have fallen out of fashion, but obviously important to many kinds of information extraction — probably awaiting a breakthrough idea.

Discourse coherence On the rise in linguistics but perhaps not in NLP. Essential to all aspects of NLU, though, so a breakthrough would probably have widespread inﬂuence.

44 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

References I

Asher, Nicholas and Alex Lascarides. 1993. Temporal interpretation, discourse relations, and common sense entailment. Linguistics and Philosophy 16(5):437–493. Asher, Nicholas and Alex Lascarides. 2003. Logics of Conversation. Cambridge: Cambridge University Press. Banko, Michele and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, 26–33. Toulouse, France: Association for Computational Linguistics. doi: bibinfo doi 10.3115/1073012.1073017 . URL http://www.aclweb.org/anthology/P01-1005. Beeferman,\ Doug;{ }{ Adam Berger; and John Lafferty.} 1999. Statistical models for text segmentation. Machine Learning 34:177–210. doi: bibinfo doi 10.1023/A:1007506220214 . URL http://dl.acm.org/citation.cfm?id=309497.309507\ { }{ . } Carlson, Lynn; Daniel Marcu; and Mary Ellen Okurowski. 2001. Building a discourse-tagged corpus in the framework of Rhetorical Structure Theory. In Proceedings of the Second SIGDial Workshop on Discourse and Dialogue. Association for Computational Linguistics. Choi, Freddy Y. Y. 2000. Advances in domain independent linear text segmentation. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics, 26–33. Seattle, WA: Association for Computational Linguistics. Halevy, Alon; Peter Norvig; and Fernando Pereira. 2009. The unreasonable effectiveness of data. IEEE Intelligent Systems 24(2):8–12. Halliday, Michael A. K. and Ruquaiya Hasan. 1976. Cohesion in English. London: Longman. Hearst, Marti A. 1994. Multi-paragraph segmentation of expository text. In 32nd Annual Meeting of the Association for Computational Linguistics, 9–16. Las Cruces, New Mexico: Association for Computational Linguistics. Hearst, Marti A. 1997. Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1):33–64. Hobbs, Jerry R. 1979. Coherence and coreference. Cognitive Science 3(1):67–90.

45 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

References II

Hobbs, Jerry R. 1985. On the Coherence and Structure of Discourse. Stanford, CA: CSLI Publications. Hobbs, Jerry R. 1990. Literature and Cognition, volume 21 of Lecture Notes. Stanford, CA: CSLI Publications. Jurafsky, Daniel and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 2nd edition. Kehler, Andrew. 2000. Conherence and the resolution of ellipsis. Linguistics and Philosophy 23(6):533–575. Kehler, Andrew. 2002. Coherence, Reference, and the Theory of Grammar. Stanford, CA: CSLI. Kehler, Andrew. 2004. Discourse coherence. In Laurence R. Horn and Gregory Ward, eds., Handbook of Pragmatics, 241–265. Oxford: Blackwell Publishing Ltd. Kehler, Andrew; Laura Kertz; Hannah Rohde; and Jeffrey L. Elman. 2007. Coherence and coreference revisted. Journal of Semantics 15(1):1–44. Knott, Alistair and Ted J. M. Sanders. 1998. The classiﬁcation of coherence relations and their linguistic markers: An exploration of two languages. Journal of Pragmatics 30(2):135–175. Longacre, Robert E. 1983. The Grammar of Discourse. New York: Plenum Press. Mann, William C. and Sandra A. Thompson. 1988. Rhetorical structure theory: Toward a functional theory of text organization. Text 8(3):243–281. Manning, Christopher D. 1998. Rethinking text segmentation models: An information extraction case study. Technical Report SULTRY-98-07-01, University of Sydney. Marcu, Daniel and Abdessamad Echihabi. 2002. An unsupervised approach to recognizing discourse relations. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, 368–375. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. doi: bibinfo doi 10.3115/1073083.1073145 . URL http://www.aclweb.org/anthology/P02-1047. Martin,\ James{ R. 1992.}{ English Text: Systems} and Structure. Amsterdam: John Benjamins. Pevzner, Lev and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics 28(1):19–36.

46 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

References III

Pitler, Emily; Annie Louis; and Ani Nenkova. 2009. Automatic sense prediction for implicit discourse relations in text. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 683–691. Suntec, Singapore: Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P/P09/P09-1077. Prasad, Rashmi; Nikhil Dinesh; Alan Lee; Eleni Miltsakaki; Livio Robaldo; Aravind Joshi; and Bonnie Webber. 2008. The Penn Discourse Treebank 2.0. In Nicoletta Calzolari; Khalid Choukri; Bente Maegaard; Joseph Mariani; Jan Odjik; Stelios Piperidis; and Daniel Tapias, eds., Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2008/. Reynar, Jeffrey C. 1994. An automatic method for ﬁnding topic boundaries. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 331–333. Las Cruces, New Mexico: Association for Computational Linguistics. doi: bibinfo doi 10.3115/981732.981783 . URL http://www.aclweb.org/anthology/P94-1050\ . { }{ } Reynar, Jeffrey C. 1998. Topic Segmentation: Algorithms and Applications. Ph.D. thesis, University of Pennsylvania, Philadelphia, PA. Sharp, Bernadette and Caroline Chibelushi. 2008. Text segmentation of spoken meeting transcripts. International Journal of Speech Technology 11:157–165. 10.1007/s10772-009-9048-2, URL http://dx.doi.org/10.1007/s10772-009-9048-2. Stone, Matthew. 2002. Communicative intentions and conversational processes in human–human and human–computer dialogue. In John Trueswell and Michael Tanenhaus, eds., World Situated Language Use: Psycholinguistic, Linguistic, and Computational Perspectives on Bridging the Product and Action Traditions. Cambridge, MA: MIT Press. Webber, Bonnie; Matthew Stone; Aravind Joshi; and Alistair Knott. 2003. Anaphora and discourse structure. Computational Linguistics 29(4):545–587.

47 / 48 Overview Discourse segmentation Discourse coherence theories Penn Discourse Treebank 2.0 Unsupervised coherence Conclusion

References IV

Wolf, Florian and Edward Gibson. 2005. Representing discourse coherence: A corpus-based study. Computational Linguistics 31(2):249–287.

48 / 48