Automatic sense prediction for implicit relations in text

Emily Pitler, Annie Louis, Ani Nenkova Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA epitler,lannie,[email protected]

Abstract in newspaper text. For our experiments, we use the Penn Discourse Treebank, the largest exist- We present a series of experiments on au- ing corpus of discourse annotations for both im- tomatically identifying the sense of im- plicit and explicit relations. Our work is also plicit discourse relations, i.e. relations informed by the long tradition of data intensive that are not marked with a discourse con- methods that rely on huge amounts of unanno- nective such as “but” or “because”. We tated text rather than on manually tagged corpora work with a corpus of implicit relations (Marcu and Echihabi, 2001; Blair-Goldensohn et present in newspaper text and report re- al., 2007). sults on a test set that is representative In our analysis, we only on implicit dis- of the naturally occurring distribution of course relations and clearly separate these from senses. We use several linguistically in- explicits. Explicit relations are easy to iden- formed features, including polarity tags, tify. The most general senses (comparison, con- Levin verb classes, length of verb phrases, tingency, temporal and expansion) can be disam- modality, , and lexical features. In biguated in explicit relations with 93% accuracy addition, we revisit past approaches using based solely on the discourse connective used to lexical pairs from unannotated text as fea- signal the relation (Pitler et al., 2008). So report- tures, explain some of their shortcomings ing results on explicit and implicit relations sepa- and propose modifications. Our best com- rately will allow for clearer tracking of progress. bination of features outperforms the base- In this paper we investigate the effectiveness of line from data intensive approaches by 4% various features designed to capture lexical and for comparison and 16% for contingency. semantic regularities for identifying the sense of implicit relations. Given two text spans, previous 1 Introduction work has used the cross-product of the words in Implicit discourse relations abound in text and the spans as features. We examine the most infor- readers easily recover the sense of such relations mative word pair features and find that they are not during semantic interpretation. But automatic the semantically-related pairs that researchers had sense prediction for implicit relations is an out- hoped. We then introduce several other methods standing challenge in discourse processing. capturing the of the spans (polarity fea- Discourse relations, such as causal and tures, semantic classes, tense, etc.) and evaluate relations, are often marked by explicit discourse their effectiveness. This is the first study which connectives (also called cue words) such as “be- reports results on classifying naturally occurring cause” or “but”. It is not uncommon, though, for a implicit relations in text and uses the natural dis- to hold between two text spans tribution of the various senses. without an explicit discourse connective, as the ex- ample below demonstrates: 2 Related Work (1) The 101-year-old magazine has never had to woo ad- Experiments on implicit and explicit relations vertisers with quite so much fervor before. [because] It largely rested on its hard-to-fault demo- Previous work has dealt with the prediction of dis- graphics. course relation sense, but often for explicits and at In this paper we address the problem of au- the sentence level. tomatic sense prediction for discourse relations Soricut and Marcu (2003) address the task of

683 Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 683–691, Suntec, Singapore, 2-7 August 2009. c 2009 ACL and AFNLP parsing discourse structures within the same sen- 3 Penn Discourse Treebank tence. They use the RST corpus (Carlson et al., For our experiments, we use the Penn Discourse 2001), which contains 385 Wall Street Journal ar- Treebank (PDTB; Prasad et al., 2008), the largest ticles annotated following the Rhetorical Structure available annotated corpora of discourse relations. Theory (Mann and Thompson, 1988). Many of The PDTB contains discourse annotations over the the useful features, syntax in particular, exploit same 2,312 Wall Street Journal (WSJ) articles as the fact that both arguments of the connective are the Penn Treebank. found in the same sentence. Such features would For each explicit discourse connective (such as not be applicable to the analysis of implicit rela- “but” or “so”), annotators identified the two text tions that occur intersententially. spans between which the relation holds and the Wellner et al. (2006) used the GraphBank (Wolf sense of the relation. and Gibson, 2005), which contains 105 Associated The PDTB also provides information about lo- Press and 30 Wall Street Journal articles annotated cal implicit relations. For each pair of adjacent with discourse relations. They achieve 81% accu- sentences within the same paragraph, annotators racy in sense disambiguation on this corpus. How- selected the explicit discourse connective which ever, GraphBank annotations do not differentiate best expressed the relation between the sentences between implicits and explicits, so it is difficult to and then assigned a sense to the relation. In Exam- verify success for implicit relations. ple (1) above, the annotators identified “because” as the most appropriate connective between the Experiments on artificial implicits Marcu and sentences, and then labeled the implicit discourse Echihabi (2001) proposed a method for cheap ac- relation Contingency. quisition of training data for discourse relation In the PDTB, explicit and implicit relations are sense prediction. Their idea is to use unambiguous clearly distinguished, allowing us to concentrate patterns such as [Arg1, but Arg2.] to create syn- solely on the implicit relations. thetic examples of implicit relations. They delete As mentioned above, each implicit and explicit the connective and use [Arg1, Arg2] as an example relation is annotated with a sense. The senses of an implicit relation. are arranged in a hierarchy, allowing for annota- tions as specific as Contingency.Cause.reason. In The approach is tested using binary classifica- our experiments, we use only the top level of the tion between relations on balanced data, a setting sense annotations: Comparison, Contingency, Ex- very different from that of any realistic applica- pansion, and Temporal. Using just these four rela- tion. For example, a question-answering appli- tions allows us to be theory-neutral; while differ- cation that needs to identify causal relations (i.e. ent frameworks (Hobbs, 1979; McKeown, 1985; as in Girju (2003)), must not only differentiate Mann and Thompson, 1988; Knott and Sanders, causal relations from comparison relations, but 1998; Asher and Lascarides, 2003) include differ- also from expansions, temporal relations, and pos- ent relations of varying specificities, all of them sibly no relation at all. In addition, using equal include these four core relations, sometimes under numbers of examples of each type can be mislead- different names. ing because the distribution of relations is known Each relation in the PDTB takes two arguments. to be skewed, with expansions occurring most fre- Example (1) can be seen as the Con- quently. Causal and comparison relations, which tingency which takes the two sentences as argu- are most useful for applications, are less frequent. ments. For implicits, the span in the first sentence Because of this, the recall of the classification is called Arg1 and the span in the following sen- should be the primary metric of success, while tence is called Arg2. the Marcu and Echihabi (2001) experiments report only accuracy. 4 Word pair features in prior work Later work (Blair-Goldensohn et al., 2007; Cross product of words Discourse connectives Sporleder and Lascarides, 2008) has discovered are the most reliable predictors of the semantic that the models learned do not perform as well on sense of the relation (Marcu, 2000; Pitler et al., implicit relations as one might expect from the test 2008). However, in the absence of explicit mark- accuracies on synthetic data. ers, the most easily accessible features are the

684 words in the two text spans of the relation. In- In a similar vein, Lapata and Lascarides (2004) tuitively, one would expect that there is some rela- used pairings of only verbs, nouns and adjectives tionship that holds between the words in the two for predicting which temporal connective is most arguments. Consider for example the following suitable to express the relation between two given sentences: text spans. Verb pairs turned out to be one of the The recent explosion of country funds mirrors the ”closed- best features, but no useful information was ob- end fund mania” of the 1920s, Mr. Foot says, when narrowly tained using nouns and adjectives. focused funds grew wildly popular. They fell into oblivion Blair-Goldensohn et al. (2007) proposed sev- after the 1929 crash. eral refinements of the word pair model. They The words “popular” and “oblivion” are almost show that (i) stemming, (ii) using a small fixed antonyms, and one might hypothesize that their vocabulary size consisting of only the most fre- occurrence in the two text spans is what triggers quent stems (which would tend to be dominated the contrast relation between the sentences. Sim- by function words) and (iii) a cutoff on the mini- ilarly, a pair of words such as (rain, rot) might be mum frequency of a feature, all result in improved indicative of a causal relation. If this hypothesis is performance. They also report that filtering stop- words has a negative impact on the results. correct, pairs of words (w1, w2) such that w1 ap- Given these findings, we expect that pairs of pears in the first sentence and w2 appears in the second sentence would be good features for iden- function words are informative features helpful in tifying contrast relations. predicting discourse relation sense. In our work that we describe next, we use feature selection to Indeed, word pairs form the basic feature investigate the word pairs in detail. of most previous work on classifying implicit relations (Marcu and Echihabi, 2001; Blair- 5 Analysis of word pair features Goldensohn et al., 2007; Sporleder and Las- carides, 2008) or the simpler task of predicting For the analysis of word pair features, we use which connective should be used to express a rela- a large collection of automatically extracted ex- tion (Lapata and Lascarides, 2004). plicit examples from the experiments in Blair- Semantic relations vs. function word pairs If Goldensohn et al. (2007). The data, from now on the hypothesis for word pair triggers of discourse referred to as TextRels, has explicit contrast and relations were true, the analysis of unambiguous causal relations which were extracted from the En- relations can be used to discover pairs of words glish Gigaword Corpus (Graff, 2003) which con- with causal or contrastive relations holding be- tains over four million newswire articles. tween them. Yet, feature analysis has not been per- The explicit cue phrase is removed from each formed in prior studies to establish or refute this example and the spans are treated as belonging to possibility. an implicit relation. Besides cause and contrast, At the same time, feature selection is always the TextRels data include a no-relation category necessary for word pairs, which are numerous and which consists of sentences from the same text that lead to data sparsity problems. Here, we present a are separated by at least three other sentences. meta analysis of the feature selection work in three To identify features useful for classifying com- prior studies. parison vs other relations, we chose a random sam- ple of 5000 examples for Contrast and 5000 Other One approach for reducing the number of fea- relations (2500 each of Cause and No-relation). tures follows the hypothesis of semantic rela- For the complete set of 10,000 examples, word tions between words. Marcu and Echihabi (2001) pair features were computed. After removing considered only nouns, verbs and and other cue word pairs that appear less than 5 times, the re- phrases in word pairs. They found that even maining features were ranked by information gain with millions of training examples, prediction re- using the MALLET toolkit1. sults using all words were superior to those based Table 1 lists the word pairs with highest infor- on only pairs of non-function words. However, mation gain for the Contrast vs. Other and Cause since the learning curve is steeper when function vs. Other classification tasks. All contain very fre- words were removed, they hypothesize that using quent stop words, and interestingly for the Con- only non-function words will outperform using all words once enough training data is available. 1mallet.cs.umass.edu

685 trast vs. Other task, most of the word pairs contain Lascarides, 2004) is not the best option2. discourse connectives. This is certainly unexpected, given that word 6 Features for sense prediction of pairs were formed by deleting the discourse con- implicit discourse relations nectives from the sentences expressing Contrast. The contrast between the “popular”/“oblivion” ex- Word pairs containing “but” as one of their ele- ample we started with above can be analyzed in ments in fact signal the presence of a relation that terms of lexical relations (near antonyms), but also is not Contrast. could be explained by different polarities of the Consider the example shown below: two words: “popular” is generally a positive word, The government says it has reached most isolated townships while “oblivion” has negative connotations. by now, but because roads are blocked, getting anything but While we agree that the actual words in the ar- basic food supplies to people remains difficult. guments are quite useful, we also define several Following Marcu and Echihabi (2001), the pair higher-level features corresponding to various se- [The government says it has reached most isolated mantic properties of the words. The words in the townships by now, but] and [roads are blocked, two text spans of a relation are taken from the getting anything but basic food supplies to peo- gold-standard annotations in the PDTB. ple remains difficult.] is created as an example of Polarity Tags: We define features that represent the Cause relation. Because of examples like this, the sentiment of the words in the two spans. Each “but-but” is a very useful word pair feature indi- word’s polarity was assigned according to its en- cating Cause, as the but would have been removed try in the Multi-perspective Question Answering for the artifical Contrast examples. In fact, the top Opinion Corpus (Wilson et al., 2005). In this re- 17 features for classifying Contrast versus Other source, each sentiment word is annotated as posi- all contain the word “but”, and are indications that tive, negative, both, or neutral. We use the number the relation is Other. of negated and non-negated positive, negative, and These findings indicate an unexpected anoma- neutral sentiment words in the two text spans as lous effect in the use of synthetic data. Since re- features. If a writer refers to something as “nice” lations are created by removing connectives, if an in Arg1, that counts towards the positive sentiment unambiguous connective remains, its presence is a count (Arg1Positive); “not nice” would count to- reliable indicator that the example should be clas- wards Arg1NegatePositive. A sentiment word is sified as Other. Such features might work well and negated if a word with a General Inquirer (Stone lead to high accuracy results in identifying syn- et al., 1966) Negate tag precedes it. We also have thetic implicit relations, but are unlikely to be use- features for the cross products of these polarities ful in a realistic setting of actual implicits. between Arg1 and Arg2. We expected that these features could help Comparison vs. Other Contingency vs. Other Comparison examples especially. Consider the the-but s-but the-in the-and in-the the-of of-but for-but but-but said-said to-of the-a following example: in-but was-but it-but a-and a-the of-the Executives at Time Inc. Magazine Co., a subsidiary of to-but that-but the-it* to-and to-to the-in Time Warner, have said the joint venture with Mr. Lang and-but but-the to-it* and-and the-the in-in a-but he-but said-in to-the of-and a-of wasn’t a good one. The venture, formed in 1986, was sup- said-but they-but of-in in-and in-of s-and posed to be Time’s low-cost, safe entry into women’s maga- zines. Table 1: Word pairs with highest information gain. The word good is annotated with positive po- larity, however it is negated. Safe is tagged as Also note that the only two features predic- having positive polarity, so this opposition could tive of the comparison class (indicated by * in indicate the Comparison relation between the two Table 1): the-it and to-it, contain only func- sentences. tion words rather than semantically related non- Inquirer Tags: To get at the meanings of the function words. This ranking explains the obser- spans, we look up what semantic categories each vations reported in Blair-Goldensohn et al. (2007) where removing stopwords degraded classifier 2In addition, an informal inspection of 100 word pairs with high information gain for Contrast vs. Other (the longest performance and why using only nouns, verbs or word pairs were chosen, as those are more likely to be content adjectives (Marcu and Echihabi, 2001; Lapata and words) found only six semantically opposed pairs.

686 word falls into according to the General Inquirer of the spans’ likelihoods according to the vari- lexicon (Stone et al., 1966). The General In- ous language models. For example, if of the un- quirer has classes for positive and negative polar- igram models, the most likely relation to generate ity, as well as more fine-grained categories such as this example was Contingency, then the example words related to virtue or vice. The Inquirer even would include the feature ContingencyUnigram1. contains a category called “Comp” that includes If the third most likely relation according to the words that tend to indicate Comparison, such as bigram models was Expansion, then it would in- “optimal”, “other”, “supreme”, or “ultimate”. clude the feature ExpansionBigram3. Several of the categories are complementary: Expl-LM: This feature ranks the text spans ac- Understatement versus Overstatement, Rise ver- cording to language models derived from the ex- sus Fall, or Pleasure versus Pain. Pairs where one plicit examples in the TextRels corpus. However, argument contains words that indicate Rise and the the corpus contains only Cause, Contrast and No- other argument indicates Fall might be good evi- relation, hence we expect the WSJ language mod- dence for a Comparison relation. els to be more helpful. The benefit of using these tags instead of just Verbs: These features include the number of the word pairs is that we see more observations for pairs of verbs in Arg1 and Arg2 from the same each semantic class than for any particular word, verb class. Two verbs are from the same verb class reducing the data sparsity problem. For example, if each of their highest Levin verb class (Levin, the pair rose:fell often indicates a Comparison re- 1993) levels (in the LCS Database (Dorr, 2001)) lation when speaking about stocks. However, oc- are the same. The intuition behind this feature is casionally authors refer to stock prices as “jump- that the more related the verbs, the more likely the ing” rather than “rising”. Since both jump and rise relation is an Expansion. are members of the Rise class, new jump examples The verb features also include the average can be classified using past rise examples. length of verb phrases in each argument, as well Development testing showed that including fea- as the cross product of this feature for the two ar- tures for all words’ tags was not useful, so we in- guments. We hypothesized that verb chunks that clude the Inquirer tags of only the verbs in the two contain more words, such as “They [are allowed to arguments and their cross-product. Just as for the proceed]” often contain rationales afterwards (sig- polarity features, we include features for both each nifying Contingency relations), while short verb tag and its . phrases like “They proceed” might occur more of- Money/Percent/Num: If two adjacent sen- ten in Expansion or Temporal relations. tences both contain numbers, dollar amounts, or Our final verb features were the part of speech percentages, it is likely that a comparison rela- tags (gold-standard from the Penn Treebank) of tion might hold between the sentences. We in- the main verb. One would expect that Expansion cluded a feature for the count of numbers, percent- would link sentences with the same tense, whereas ages, and dollar amounts in Arg1 and Arg2. We Contingency and Temporal relations would con- also included the number of times each combina- tain verbs with different tenses. tion of number/percent/dollar occurs in Arg1 and First-Last, First3: The first and last words of Arg2. For example, if Arg1 mentions a percent- a relation’s arguments have been found to be par- age and Arg2 has two dollar amounts, the feature ticularly useful for predicting its sense (Wellner et Arg1Percent-Arg2Money would have a count of 2. al., 2006). Wellner et al. (2006) suggest that these This feature is probably genre-dependent. Num- words are such predictive features because they bers and percentages often appear in financial texts are often explicit discourse connectives. In our but would be less frequent in other genres. experiments on implicits, the first and last words WSJ-LM: This feature represents the extent to are not connectives. However, some implicits have which the words in the text spans are typical of been found to be related by connective-like ex- each relation. For each sense, we created uni- pressions which often appear in the beginning of gram and bigram language models over the im- the second argument. In the PDTB, these are an- plicit examples in the training set. We compute notated as alternatively lexicalized relations (Al- each example’s probability according to each of tLexes). To capture such effects, we included the these language models. The features are the ranks first and last words of Arg1 as features, the first

687 and last words of Arg2, the pair of the first words We ran four binary classification tasks to iden- of Arg1 and Arg2, and the pair of the last words. tify each of the main relations from the rest. As We also add two additional features which indicate each of the relations besides Expansion are infre- the first three words of each argument. quent, we train using equal numbers of positive Modality: Modal words, such as “can”, and negative examples of the target relation. The “should”, and “may”, are often used to express negative examples were chosen at random. We conditional statements (i.e. “If I were a wealthy used all of sections 21 and 22 for testing, so the man, I wouldn’t have to work hard.”) thus signal- test set is representative of the natural distribution. ing a Contingency relation. We include a feature The training sets contained: Comparison (1927 for the presence or absence of modals in Arg1 and positive, 1927 negative), Contingency (3500 Arg2, features for specific modal words, and their each), Expansion3 (6356 each), and Temporal cross-products. (730 each). Context: Some implicit relations appear imme- The test set contained: 151 examples of Com- diately before or immediately after certain explicit parison, 291 examples of Contingency, 986 exam- relations far more often than one would expect due ples of Expansion, 82 examples of Temporal, and to chance (Pitler et al., 2008). We define a feature 13 examples of No-relation. indicating if the immediately preceding (or follow- We used Naive Bayes, Maximum Entropy ing) relation was an explicit. If it was, we include (MaxEnt), and AdaBoost (Freund and Schapire, the connective trigger of the relation and its sense 1996) classifiers implemented in MALLET. as features. We use oracle annotations of the con- nective sense, however, most of the connectives 7.1 Non-Wordpair Features are unambiguous. The performance using only our semantically in- One might expect a different distribution of re- formed features is shown in Table 7. Only the lation types in the beginning versus further in the Naive Bayes classification results are given, as middle of a paragraph. We capture paragraph- space is limited and MaxEnt and AdaBoost gave position information using a feature which indi- slightly lower accuracies overall. cates if Arg1 begins a paragraph. The table lists the f-score for each of the target Word pairs Four variants of word pair mod- relations, with overall accuracy shown in brack- els were used in our experiments. All the models ets. Given that the experiments are run on natural were eventually tested on implicit examples from distribution of the data, which are skewed towards the PDTB, but the training set-up was varied. Expansion relations, the f-score is the more impor- Wordpairs-TextRels In this setting, we trained tant measure to track. a model on word pairs derived from unannotated Our random baseline is the f-score one would text (TextRels corpus). achieve by randomly assigning classes in propor- Wordpairs-PDTBImpl Word pairs for training tion to its true distribution in the test set. The best were formed from the cross product of words in results for all four tasks are considerably higher the textual spans (Arg1 x Arg2) of the PDTB im- than random prediction, but still low overall. Our plicit relations. features provide 6% to 18% absolute improve- Wordpairs-selected Here, only word pairs from ments in f-score over the baseline for each of the Wordpairs-PDTBImpl with non-zero information four tasks. The largest gain was in the Contin- gain on the TextRels corpus were retained. gency versus Other prediction task. The least im- Wordpairs-PDTBExpl In this case, the model provement was for distinguishing Expansion ver- was formed by using the word pairs from the ex- sus Other. However, since Expansion forms the plicit relations in the sections of the PDTB used largest class of relations, its f-score is still the for training. highest overall. We discuss the results per relation class next. 7 Classification Results Comparison We expected that polarity features would be especially helpful for identifying Com- For all experiments, we used sections 2-20 of the PDTB for training and sections 21-22 for testing. 3The PDTB also contains annotations of entity relations, which most frameworks consider a subset of Expansion. Sections 0-1 were used as a development set for Thus, we include relations annotated as EntRel as positive feature design. examples of Expansion.

688 Features Comp. vs. Not Cont. vs. Other Exp. vs. Other Temp. vs. Other Four-way Money/Percent/Num 19.04 (43.60) 18.78 (56.27) 22.01 (41.37) 10.40 (23.05) (63.38) Polarity Tags 16.63 (55.22) 19.82 (76.63) 71.29 (59.23) 11.12 (18.12) (65.19) WSJ-LM 18.04 (9.91) 0.00 (80.89) 0.00 (35.26) 10.22 (5.38) (65.26) Expl-LM 18.04 (9.91) 0.00 (80.89) 0.00 (35.26) 10.22 (5.38) (65.26) Verbs 18.55 (26.19) 36.59 (62.44) 59.36 (52.53) 12.61 (41.63) (65.33) First-Last, First3 21.01 (52.59) 36.75 (59.09) 63.22 (56.99) 15.93 (61.20) (65.40) Inquirer tags 17.37 (43.8) 15.76 (77.54) 70.21 (58.04) 11.56 (37.69) (62.21) Modality 17.70 (17.6) 21.83 (76.95) 15.38 (37.89) 11.17 (27.91) (65.33) Context 19.32 (56.66) 29.55 (67.42) 67.77 (57.85) 12.34 (55.22) (64.01) Random 9.91 19.11 64.74 5.38 Table 2: f-score (accuracy) using different features; Naive Bayes. parison relations. Surprisingly, polarity was actu- other relations. ally one of the worst classes of features for Com- Expansion As Expansion is the majority class parison, achieving an f-score of 16.33 (in contrast in the natural distribution, recall is less of a prob- to using the first, last and first three words of the lem than precision. The features that help achieve sentences as features, which leads to an f-score of the best f-score are all features that were found to 21.01). We examined the prevalence of positive- be useful in identifying other relations. negative or negative-positive polarity pairs in our Polarity tags, Inquirer tags and context were training set. 30% of the Comparison examples the best features for identifying expansions with contain one of these opposite polarity pairs, while f-scores around 70%. 31% of the Other examples contain an opposite Temporal Implicit temporal relations are rela- polarity pair. To our knowledge, this is the first tively rare, making up only about 5% of our test study to examine the prevalence of polarity words set. Most temporal relations are explicitly marked in the arguments of discourse relations in their with a connective like “when” or “after”. natural distributions. Contrary to popular belief, Yet again, the first and last words of the sen- Comparisons do not tend to have more opposite tence turned out to be useful indicators for tem- polarity pairs. poral relations (15.93 f-score). The importance of The two most useful classes of features for rec- the first and last words for this distinction is clear. ognizing Comparison relations were the first, last It derives from the fact that temporal implicits of- and first three words in the sentence and the con- ten contain words like “yesterday” or “Monday” at text features that indicate the presence of a para- the end of the sentence. Context is the next most graph boundary or of an explicit relation just be- helpful feature for temporal relations. fore or just after the location of the hypothesized implicit relation (19.32 f-score). 7.2 Which word pairs help? Contingency The two best features for the Con- For Comparison and Contingency, we analyze the tingency vs. Other distinction were verb informa- behavior of word pair features under several differ- tion (36.59 f-score) and first, last and first three ent settings. Specifically we want to address two words in the sentence (36.75 f-score). Context important related questions raised in recent work again was one of the features that led to improve- by others: (i) is unannotated data from explicits ment. This makes sense, as Pitler et al. (2008) useful for training models that disambiguate im- found that implicit contingencies are often found plicit discourse relations and (ii) are explicit and immediately following explicit comparisons. implicit relations intrinsically different from each We were surprised that the polarity features other. were helpful for Contingency but not Comparison. Wordpairs-TextRels is the worst approach. The Again we looked at the prevalence of opposite po- best use of word pair features is Wordpairs- larity pairs. While for Comparison versus Other selected. This model gives 4% better absolute f- there was not a significant difference, for Contin- score for Comparison and 14% for Contingency gency there are quite a few more opposite polarity over Wordpairs-TextRels. In this setting the Tex- pairs (52%) than for not Contingency (41%). tRels data was used to choose the word pair fea- The language model features were completely tures, but the probabilities for each feature were useless for distinguishing contingencies from estimated using the training portion of the PDTB

689 Comp. vs. Other improvements using a sequence model, rather than Wordpairs-TextRels 17.13 (46.62) Wordpairs-PDTBExpl 19.39 (51.41) classifying each relation independently. Wordpairs-PDTBImpl 20.96 (42.55) We trained a CRF classifier (Lafferty et al., First-last, first3 (best-non-wp) 21.01 (52.59) 2001) over the sequence of implicit examples from Best-non-wp + Wordpairs-selected 21.88 (56.40) Wordpairs-selected 21.96 (56.59) all documents in sections 02 to 20. The test set Cont. vs. Other is the same as used for the 2-way classifiers. We Wordpairs-TextRels 31.10 (41.83) 4 Wordpairs-PDTBExpl 37.77 (56.73) compare against a 6-way Naive Bayes classifier. Wordpairs-PDTBImpl 43.79 (61.92) Only word pairs were used as features for both. Polarity, verbs, first-last, first3, 42.14 (66.64) Overall 6-way prediction accuracy is 43.27% for modality, context (best-non-wp) Wordpairs-selected 45.60 (67.10) the Naive Bayes model and 44.58% for the CRF Best-non-wp + Wordpairs-selected 47.13 (67.30) model. Expn. vs. Other Best-non-wp + wordpairs 62.39 (59.55) 8 Conclusion Wordpairs-PDTBImpl 63.84 (60.28) Polarity, inquirer tags, context (best- 76.42 (63.62) We have presented the first study that predicts im- non-wp) Temp. vs. Other plicit discourse relations in a realistic setting (dis- First-last, first3 (best-non-wp) 15.93 (61.20) tinguishing a relation of interest from all others, Wordpairs-PDTBImpl 16.21 (61.98) where the relations occur in their natural distri- Best-non-wp + Wordpairs-PDTBImpl 16.76 (63.49) butions). Also unlike prior work, we separate the Table 3: f-score (accuracy) of various feature sets; task from the easier task of explicit discourse pre- Naive Bayes. diction. Our experiments demonstrate that fea- tures developed to capture word polarity, verb classes and orientation, as well as some lexical implicit examples. features are strong indicators of the type of dis- We also confirm that even within the PDTB, course relation. information from annotated explicit relations We analyze word pair features used in prior (Wordpairs-PDTBExpl) is not as helpful as work that were intended to capture such semantic information from annotated implicit relations oppositions. We show that the features in fact do (Wordpairs-PDTBImpl). The absolute difference not capture semantic relation but rather give infor- in f-score between the two models is close to 2% mation about function word co-occurrences. How- for Comparison, and 6% for Contingency. ever, they are still a useful source of information 7.3 Best results for discourse relation prediction. The most bene- ficial application of such features is when they are Adding other features to word pairs leads to im- selected from a large unannotated corpus of ex- proved performance for Contingency, Expansion plicit relations, but then trained on manually an- and Temporal relations, but not for Comparison. notated implicit relations. For contingency detection, the best combina- Context, in terms of paragraph boundaries and tion of our features included polarity, verb in- nearby explicit relations, also proves to be useful formation, first and last words, modality, context for the prediction of implicit discourse relations. with Wordpairs-selected. This combination led It is helpful when added as a feature in a standard, to a definite improvement, reaching an f-score of instance-by-instance learning model. A sequence 47.13 (16% absolute improvement in f-score over model also leads to over 1% absolute improvement Wordpairs-TextRels). for the task. For detecting expansions, the best combination of our features (polarity+Inquirer tags+context) 9 Acknowledgments outperformed Wordpairs-PDTBImpl by a wide margin, close to 13% absolute improvement (f- This work was partially supported by NSF grants scores of 76.42 and 63.84 respectively). IIS-0803159, IIS-0705671 and IGERT 0504487. We would like to thank Sasha Blair-Goldensohn 7.4 Sequence Model of Discourse Relations for providing us with the TextRels data and for Our results from the previous section show that the insightful discussion in the early stages of our classification of implicits benefits from informa- work. tion about nearby relations, and so we expected 4the four main relations, EntRel, NoRel

690 K. McKeown. 1985. Text Generation: Using Dis- course strategies and Focus Constraints to Gener- N. Asher and A. Lascarides. 2003. of conver- ate Natural Language Text. Cambridge University sation. Cambridge University Press. Press, Cambridge, England. S. Blair-Goldensohn, K.R. McKeown, and O.C. Ram- E. Pitler, M. Raghupathy, H. Mehta, A. Nenkova, bow. 2007. Building and Refining Rhetorical- A. Lee, and A. Joshi. 2008. Easily identifiable dis- Semantic Relation Models. In Proceedings of course relations. In Proceedings of the 22nd Inter- NAACL HLT, pages 428–435. national Conference on Computational Linguistics (COLING08), short paper. L. Carlson, D. Marcu, and M.E. Okurowski. 2001. Building a discourse-tagged corpus in the frame- R. Soricut and D. Marcu. 2003. Sentence level dis- work of rhetorical structure theory. In Proceedings course parsing using syntactic and lexical informa- of the Second SIGdial Workshop on Discourse and tion. In HLT-NAACL. Dialogue, pages 1–10. C. Sporleder and A. Lascarides. 2008. Using automat- B.J. Dorr. 2001. LCS Verb Database. Technical Re- ically labelled examples to classify rhetorical rela- port Online Software Database, University of Mary- tions: An assessment. Natural Language Engineer- land, College Park, MD. ing, 14:369–416. Y. Freund and R.E. Schapire. 1996. Experiments with P.J. Stone, J. Kirsh, and Cambridge Computer Asso- a New Boosting Algorithm. In Machine Learning: ciates. 1966. The General Inquirer: A Computer Proceedings of the Thirteenth International Confer- Approach to Content Analysis. MIT Press. ence, pages 148–156. B. Wellner, J. Pustejovsky, C. Havasi, A. Rumshisky, R. Girju. 2003. Automatic detection of causal relations and R. Sauri. 2006. Classification of discourse co- for Question Answering. In Proceedings of the ACL herence relations: An exploratory study using mul- 2003 workshop on Multilingual summarization and tiple knowledge sources. In Proceedings of the 7th question answering-Volume 12, pages 76–83. SIGdial Workshop on Discourse and Dialogue. D. Graff. 2003. English gigaword corpus. Corpus T. Wilson, J. Wiebe, and P. Hoffmann. 2005. Recog- number LDC2003T05, Linguistic Data Consortium, nizing contextual polarity in phrase-level sentiment Philadelphia. analysis. In Proceedings of the conference on Hu- man Language Technology and Empirical Methods J. Hobbs. 1979. Coherence and . Cogni- in Natural Language Processing, pages 347–354. tive Science, 3:67–90. F. Wolf and E. Gibson. 2005. Representing discourse A. Knott and T. Sanders. 1998. The classification of coherence: A corpus-based study. Computational coherence relations and their linguistic markers: An Linguistics, 31(2):249–288. exploration of two languages. Journal of Pragmat- ics, 30(2):135–175.

J. Lafferty, A. McCallum, and F. Pereira. 2001. Condi- tional Random Fields: Probabilistic Models for Seg- menting and Labeling Sequence Data. In Interna- tional Conference on Machine Learning 2001, pages 282–289.

M. Lapata and A. Lascarides. 2004. Inferring sentence-internal temporal relations. In HLT- NAACL 2004: Main Proceedings.

B. Levin. 1993. English Verb Classes and Alterna- tions: A Preliminary Investigation. Chicago, IL.

W.C. Mann and S.A. Thompson. 1988. Rhetorical structure theory: Towards a functional theory of text organization. Text, 8.

D. Marcu and A. Echihabi. 2001. An unsupervised approach to recognizing discourse relations. In Pro- ceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 368–375.

D. Marcu. 2000. The Theory and Practice of Dis- course and Summarization. The MIT Press.

691