Discovering Implicit Relations Through Brown Cluster Pair Representation and Patterns

Attapol T. Rutherford Nianwen Xue Department of Computer Science Department of Computer Science Brandeis University Brandeis University Waltham, MA 02453, USA Waltham, MA 02453, USA [email protected] [email protected]

Abstract Existing systems, which make heavy use of Sentences form coherent relations in a word pairs, suffer from data sparsity problem as discourse without discourse connectives a word pair in the training data may not appear more frequently than with connectives. in the test data. A better representation of two Senses of these implicit discourse rela- adjacent sentences beyond word pairs could have tions that hold between a sentence pair, a significant impact on predicting the sense of however, are challenging to infer. Here, the that holds between them. we employ Brown cluster pairs to rep- Data-driven theory-independent word classifica- resent discourse relation and incorporate tion such as Brown clustering should be able coreference patterns to identify senses of to provide a more compact word representation implicit discourse relations in naturally (Brown et al., 1992). Brown clustering algorithm occurring text. Our system improves the induces a hierarchy of words in a large unanno- baseline performance by as much as 25%. tated corpus based on word co-occurrences within Feature analyses suggest that Brown clus- the window. The induced hierarchy might give ter pairs and coreference patterns can re- rise to features that we would otherwise miss. In veal many key linguistic characteristics of this paper, we propose to use the cartesian product each type of discourse relation. of Brown cluster assignment of the sentence pair as an alternative abstract word representation for 1 Introduction building an implicit discourse relation classifier. Sentences must be pieced together logically in a Through word-level semantic commonalities discourse to form coherent text. Many discourse revealed by Brown clusters and entity-level rela- relations in the text are signaled explicitly through tions revealed by coreference resolution, we might a closed set of discourse connectives. Simply be able to paint a more complete picture of the disambiguating the meaning of discourse connec- discourse relation in question. Coreference resolu- tives can determine whether adjacent clauses are tion unveils the patterns of entity realization within temporally or causally related (Pitler et al., 2008; the discourse, which might provide clues for the Wellner et al., 2009). Discourse relations and their types of the discourse relations. The information senses, however, can also be inferred by the reader about certain entities or mentions in one sentence even without discourse connectives. These im- should be carried over to the next sentence to form plicit discourse relations in fact outnumber explicit a coherent relation. It is possible that coreference discourse relations in naturally occurring text. In- chains and semantically-related predicates in the ferring types or senses of implicit discourse re- local might show some patterns that char- lations remains a key challenge in automatic dis- acterize types of discourse relations. We hypoth- course analysis. esize that coreferential rates and coreference pat- A discourse parser requires many subcompo- terns created by Brown clusters should help char- nents which form a long pipeline. The implicit acterize different types of discourse relations. discourse relation discovery has been shown to be Here, we introduce two novel sets of features the main performance bottleneck of an end-to-end for implicit discourse relation classification. Fur- parser (Lin et al., 2010). It is also central to many ther, we investigate the effects of using Brown applications such as automatic summarization and clusters as an alternative word representation and question-answering systems. analyze the impactful features that arise from

645 Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 645–654, Gothenburg, Sweden, April 26-30 2014. c 2014 Association for Computational Linguistics Number of instances nectives. On the other hand, EXPANSION rela- Implicit Explicit COMPARISON 2503 (15.11%) 5589 (33.73%) tions might be more cleanly achieved without ones CONTINGENCY 4255 (25.68%) 3741 (22.58%) as indicated by its dominance in the implicit dis- EXPANSION 8861 (53.48%) 72 (0.43%) course relations. This imbalance in class distri- TEMPORAL 950 (5.73%) 3684 (33.73%) Total 16569 (100%) 13086 (100%) bution requires greater care in building statistical classifiers (Wang et al., 2012). Table 1: The distribution of senses of implicit dis- course relations is imbalanced. 3 Experiment setup We followed the setup of the previous studies Brown cluster pairs. We also study coreferential for a fair comparison with the two baseline sys- patterns in different types of discourse relations in tems by Pitler et al. (2009) and Park and Cardie addition to using them to boost the performance (2012). The task is formulated as four sepa- of our classifier. These two sets of features along rate one-against-all binary classification problems: with previously used features outperform the base- one for each top level sense of implicit discourse line systems by approximately 5% absolute across relations. In addition, we add one more classifica- all categories and reveal many important charac- tion task with which to test the system. We merge teristics of implicit discourse relations. ENTREL with EXPANSION relations to follow the setup used by the two baseline systems. An argu- 2 Sense annotation in Penn Discourse ment pair is annotated with ENTREL in PDTB if Treebank an entity-based coherence and no other type of re- lation can be identified between the two arguments The Penn Discourse Treebank (PDTB) is the in the pair. In this study, we assume that the gold largest corpus richly annotated with explicit standard argument pairs are provided for each re- and implicit discourse relations and their senses lation. Most argument pairs for implicit discourse (Prasad et al., 2008). PDTB is drawn from relations are a pair of adjacent sentences or adja- Wall Street Journal articles with overlapping an- cent clauses separated by a semicolon and should notations with the Penn Treebank (Marcus et al., be easily extracted. 1993). Each discourse relation contains the infor- The PDTB corpus is split into a training set, de- mation about the extent of the arguments, which velopment set, and test set the same way as in the can be a sentence, a constituent, or an incontigu- baseline systems. Sections 2 to 20 are used to train ous span of text. Each discourse relation is also classifiers. Sections 0–1 are used for developing annotated with the sense of the relation that holds feature sets and tuning models. Section 21–22 are between the two arguments. In the case of implicit used for testing the systems. discourse relations, where the discourse connec- tives are absent, the most appropriate connective The statistical models in the following exper- is annotated. iments are from MALLET implementation (Mc- The senses are organized hierarchically. Our fo- Callum, 2002) and libSVM (Chang and Lin, cus is on the top level senses because they are the 2011). For all five binary classification tasks, we four fundamental discourse relations that various try Balanced Winnow (Littlestone, 1988), Maxi- discourse analytic theories seem to converge on mum Entropy, Naive Bayes, and Support Vector (Mann and Thompson, 1988). The top level senses Machine. The parameters and the hyperparame- ters of each classifier are set to their default values. are COMPARISON,CONTINGENCY,EXPANSION, The code for our model along with the data ma- and TEMPORAL. trices is available at github.com/attapol/ The explicit and implicit discourse relations al- brown_coref_implicit. most orthogonally differ in their distributions of senses (Table 1). This difference has a few im- 4 Features plications for studying implicit discourse relations and uses of discourse connectives (Patterson and Unlike the baseline systems, all of the features Kehler, 2013). For example, TEMPORAL relations in the experiments use the output from automatic constitute only 5% of the implicit relations but natural language processing tools. We use the 33% of the explicit relations because they might Stanford CoreNLP suite to lemmatize and part- not be as natural to create without discourse con- of-speech tag each word (Toutanova et al., 2003;

646 Toutanova and Manning, 2000), obtain the phrase number of inter-sentential coreferential pairs. structure and dependency parses for each sentence We expect that EXPANSION relations should be (De Marneffe et al., 2006; Klein and Manning, more likely to have coreferential pairs because the 2003), identify all named entities (Finkel et al., detail or information about an entity mentioned 2005), and resolve coreference (Raghunathan et in Arg1 should be expanded in Arg2. Therefore, al., 2010; Lee et al., 2011; Lee et al., 2013). entity sharing might be difficult to avoid. Similar nouns and verbs: A binary feature 4.1 Features used in previous work indicating whether similar or coreferential nouns The baseline features consist of the following: are the arguments of the similar predicates. Predi- First, last, and first 3 words, numerical ex- cates and arguments are identified by dependency pressions, time expressions, average verb phrase parses. We notice that sometimes the author uses length, modality, General Inquirer tags, polarity, synonyms while trying to expand on the previous Levin verb classes, and production rules. These predicates or entities. The words that indicate the features are described in greater detail by Pitler et common topics might be paraphrased, so exact al. (2009). string matching cannot detect whether the two ar- guments still on the same topic. This might 4.2 Brown cluster pair features be useful for identifying CONTINGENCY relations To generate Brown cluster assignment pair fea- as they usually discuss two causally-related events tures, we replace each word with its hard Brown that involve two seemingly unrelated agents cluster assignment. We used the Brown word and/or predicates. clusters provided by MetaOptimize (Turian et Similar subject or main predicates: A binary al., 2010). 3,200 clusters were induced from feature indicating whether the main verbs of the RCV1 corpus, which contains about 63 million to- two arguments have the same subjects or not kens from Reuters English newswire. Then we and another binary feature indicating whether the take the Cartesian product of the Brown clus- main verbs are similar or not. For our purposes, ter assignments of the words in Arg1 and the the two subjects are said to be the same if they ones of the words in Arg2. For example, sup- are coreferential or assigned to the same Brown pose Arg1 has two words w1,1, w1,2, Arg2 has cluster. We notice that COMPARISON relations three words w2,1, w2,2, w2,3, and then B(.) maps usually have different subjects for the same main a word to its Brown cluster assignment. A verbs and that TEMPORAL relations usually have word wij is replaced by its corresponding Brown the same subjects but different main verbs. cluster assignment bij = B(wij). The result- ing word pair features are (b1,1, b2,1), (b1,1, b2,2), (b , b ), (b , b ), (b , b ) (b , b ) 1,1 2,3 1,2 2,1 1,2 2,2 , and 1,2 2,3 . 4.4 Feature selection and training sample Therefore, this feature set can generate reweighting O(32002) binary features. The feature set size is orders of magnitude smaller than using the actual The nature of the task and the dataset poses at words, which can generate O(V 2) distinct binary least two problems in creating a classifier. First, features where V is the size of the vocabulary. the classification task requires a large number of features, some of which are too rare and incon- 4.3 Coreference-based features ducive to parameter estimation. Second, the la- We want to take advantage of the of bel distribution is highly imbalanced (Table 1) and the sentence pairs even more by considering how this might degrade the performance of the classi- coreferential entities play out in the sentence pairs. fiers (Japkowicz, 2000). Recently, Park and Cardie We consider various inter-sentential coreference (2012) and Wang et al. (2012) addressed these patterns to include as features and also to better problems directly by optimally select a subset of describe each type of discourse relation with re- features and training samples. Unlike previous spect to its place in the coreference chain. work, we do not discard any of data in the training For compactness in explaining the following set to balance the label distribution. Instead, we features, we define similar words to be the words reweight the training samples in each class during assigned to the same Brown cluster. parameter estimation such that the performance on Number of coreferential pairs: We count the the development set is maximized. In addition, the

647 Current Park and Cardie (2012) Pitler et al. (2009) PR F1 F1 F1 COMPARISON vs others 27.34 72.41 39.70 31.32 21.96 CONTINGENCY vs others 44.52 69.96 54.42 49.82 47.13 EXPANSION vs others 59.59 85.50 70.23 - - EXP+ENTREL vs others 69.26 95.92 80.44 79.22 76.42 TEMPORAL vs others 18.52 63.64 28.69 26.57 16.76

Table 2: Our classifier outperform the previous systems across all four tasks without the use of gold- standard parses and coreference resolution.

COMPARISON classifier outperforms MaxEnt, Balanced Winnow, Feature set F1 % change All features 39.70 - and Support Vector Machine across all tasks re- All excluding Brown cluster pairs 35.71 -10.05% gardless of feature pruning criteria and training All excluding Production rules 37.27 -6.80% sample reweighting. A possible explanation is that All excluding First, last, and First 3 39.18 -1.40% All excluding Polarity 39.39 -0.79% the small dataset size in comparison with the large number of features might favor a generative model CONTINGENCY like Naive Bayes (Jordan and Ng, 2002). So we Feature set F1 % change All 54.42 - only report the performance from the Naive Bayes All excluding Brown cluster pairs 51.50 -5.37% classifiers. All excluding First, last, and First 3 53.56 -1.58% All excluding Polarity 53.82 -1.10% It is noteworthy that the baseline systems use All excluding Coreference 53.92 -0.92% the gold standard parses provided by the Penn Treebank, but ours does not because we would EXPANSION Feature set F1 % change like to see how our system performs realistically in All 70.23 - conjunction with other pre-processing tasks such All excluding Brown cluster pairs 67.48 -3.92% as lemmatization, parsing, and coreference reso- All excluding First, last, and First 3 69.43 -1.14% All excluding Inquirer tags 69.73 -0.71% lution. Nevertheless, our system still manages to All excluding Polarity 69.92 -0.44% outperform the baseline systems in all relations by a sizable margin. TEMPORAL Feature set F1 % change Our preliminary results on implicit sense classi- All 28.69 - fication suggest that the Brown cluster word rep- All excluding Brown cluster pairs 24.53 -14.50% All excluding Production rules 26.51 -7.60% resentation and coreference patterns might be in- All excluding First, last, and First 3 26.56 -7.42% dicative of the senses of the discourse relations, All excluding Polarity 27.42 -4.43% but we would like to know the extent of the im- Table 3: Ablation study: The four most impact- pact of these novel feature sets when used in con- ful feature classes and their relative percentage junction with other features. To this aim, we con- changes are shown. Brown cluster pair features duct an ablation study, where we exclude one of are the most impactful across all relation types. the feature sets at a time and then test the result- ing classifier on the test set. We then rank each feature set by the relative percentage change in number of occurrences for each feature must be F1 score when excluded from the classifier. The greater than a cut-off, which is also tuned on the data split and experimental setup are identical to development set to yield the highest performance the ones described in the previous section but only on the development set. with Naive Bayes classifiers. The ablation study results imply that Brown 5 Results cluster features are the most impactful feature set Our experiments show that the Brown cluster and across all four types of implicit discourse rela- coreference features along with the features from tions. When ablated, Brown cluster features de- the baseline systems improve the performance for grade the performance by the largest percentage all discourse relations (Table 2). Consistent with compared to the other feature sets regardless of the the results from previous work, the Naive Bayes relation types(Table 3). TEMPORAL relations ben-

648 efit the most from Brown cluster features. With- also verify the direction of the effects of each of out them, the F1 score drops by 4.12 absolute or the features included in the following analysis by 14.50% relative to the system that uses all of the comparing the class conditional parameters in the features. Naive Bayes model. The most dominant features for COMPARISON 6 Feature analysis classification are the pairs whose members are 6.1 Brown cluster features from the same Brown clusters. We can distinctly This feature set is inspired by the word pair fea- see this pattern from the bipartite graph because tures, which are known for its effectiveness in pre- the nodes on each side are sorted alphabetically. dicting senses of discourse relations between the The graph shows many parallel short edges, which two arguments. Marcu et al (2002), for instance, suggest that many informative pairs consist of the artificially generated the implicit discourse rela- same clusters. Some of the clusters that participate tions and used word pair features to perform the in such pair consist of named-entities from vari- classification tasks. Those word pair features work ous categories such as airlines (King, Bell, Virgin, well in this case because their artificially gener- Continental, ...), and companies (Thomson, Volk- ated dataset is an order of magnitude larger than swagen, Telstra, Siemens). Some of the pairs form PDTB. Ideally, we would want to use the word a broad category such as political agents (citizens, pair features instead of word cluster features if pilots, nationals, taxpayers) and industries (power, we have enough data to fit the parameters. Con- insurance, mining). These parallel patterns in the sequently, other less sparse handcrafted features graph demonstrate that implicit COMPARISON re- prove to be more effective than word pair features lations might be mainly characterized by juxtapos- for the PDTB data (Pitler et al., 2009). We remedy ing and explicitly contrasting two different entities the sparsity problem by clustering the words that in two adjacent sentences. are distributionally similar together and greatly re- Without the use of a named-entity recogni- duce the number of features. tion system, these Brown cluster pair features ef- Since the ablation study is not fine-grained fectively act as features that detect whether the enough to spotlight the effectiveness of the indi- two arguments in the relation contain named- vidual features, we quantify the predictiveness of entities or nouns from the same categories or not. each feature by its mutual information. Under These more subtle named-entity-related features Naive Bayes conditional independence assump- are cleanly discovered through replacing words tion, the mutual information between the features with their data-driven Brown clusters without the and the labels can be efficiently computed in a need for additional layers of pre-processing. pairwise fashion. The mutual information be- If the words in one cluster semantically relates tween a binary feature Xi and class label Y is de- to the words in another cluster, the two clusters fined as: are more likely to become informative features pˆ(x, y) for CONTINGENCY classification. For instance, I(X ,Y ) = pˆ(x, y) log i pˆ(x)ˆp(y) technical terms in stock and trading (weighted, y x=0,1 X X Nikkei, composite, diffusion) pair up with eco- pˆ( ) is the probability distribution function whose nomic terms (Trading, Interest, Demand, Produc- · parameters are maximum likelihood estimates tion). The cluster with analysts and pundits pairs from the training set. We compute mutual infor- up with the one that predominantly contains quan- mation for all four one-vs-all classification tasks. tifiers (actual, exact, ultimate, aggregate). In ad- The computation is done as part of the training dition to this pattern, we observed the same par- pipeline in MALLET to ensure consistency in pa- allel pair pattern we found in COMPARISON clas- rameter estimation and smoothing techniques. We sification. These results suggest that in establish- then rank the cluster pair features by mutual in- ing a CONTINGENCY relation implicitly the au- formation. The results are compactly summa- thor might shape the sentences such that they have rized in bipartite graphs shown in Figure 1, where semantically related words if they do not mention each edge represents a cluster pair. Since mu- named-entities of the same category. tual information itself does not indicate whether Through Brown cluster pairs, we obtain features a feature is favored by one or the other label, we that detect a shift between generality and speci-

649 Arg 1 COMPARISON Arg 2 ’ ’ Arg 1 CONTINGENCY Arg 2 American American : &,und Bank Big,Human,Civil,Greater,... Bank – Centre,Bay,Park,Hospital,... Board,Corps Christopher,Simitis,Perry,Waigel,... 10 Congress Centre,Bay,Park,Hospital,... Electric,Motor,Life,Chemical,... ; East Congress Friday Bank Exchange East Holdings,Industries,Investments,Foods,... Bill,Mrs. Fed,CWB Fed,CWB If,Unless,Whether,Maybe,... Christopher,Simitis,Perry,Waigel,... Israelis,Moslems,Jews,terrorists,... GM,Ford,Barrick,Anglo,... King,Bell,Virgin,Continental,... Dow,shuttle,DAX,Ifo,... Japan Israelis,Moslems,Jews,terrorists,... Major,Howard,Arthuis,Chang,... Electric,Motor,Life,Chemical,... King,Bell,Virgin,Continental,... Japan March Friday March King,Bell,Virgin,Continental,... Power,Insurance,Mining,Engineering,... GM,Ford,Barrick,Anglo,... Miert,Lumpur,der,Metall,... March Royal,Port,Cape,Santa,... Holdings,Industries,Investments,Foods,... Olivetti,Eurotunnel,Elf,Lagardere,... Miert,Lumpur,der,Metall,... Senate,senate I Power,Insurance,Mining,Engineering,... Olivetti,Eurotunnel,Elf,Lagardere,... age,identity,integrity,identification,... If,Unless,Whether,Maybe,... Soviet,Homeland,Patriotic Power,Insurance,Mining,Engineering,... also King,Bell,Virgin,Continental,... Standard,Hurricane,Time,Long,... Soviet,Homeland,Patriotic am,’m Major,Howard,Arthuis,Chang,... Thomson,Volkswagen,Telstra,Siemens,... Standard,Hurricane,Time,Long,... analysts,pundits March advertising,ad Thomson,Volkswagen,Telstra,Siemens,... average Power,Insurance,Mining,Engineering,... agency actual,exact,ultimate,aggregate,... back Royal,Port,Cape,Santa,... analysts,pundits advertising,ad her Senate,senate auto,semiconductor,automotive,automobile,... agency his To,Would average average index Trading,Interest,Demand,Production,... cash bank market actual,exact,ultimate,aggregate,... chemicals,entertainment,machinery,packaging,... cars,vehicles,tyres,vans,... no age,identity,integrity,identification,... citizens,pilots,nationals,taxpayers,... cash now ago closed chemicals,entertainment,machinery,packaging,... our all common citizens,pilots,nationals,taxpayers,... they computer,mainframe closed weighted,Nikkei,composite,diffusion,... sales common world computer,mainframe

Arg 1 EXPANSION Arg 2 Arg 1 TEMPORAL Arg 2 American – ’ve ’ Analysts,Economists,Diplomats,Forecasters,... American *,@,**,—,... ’re But,Saying Boeing,BT,Airbus,Netscape,... 14,13,16 ’ve Democrats December 20 *,@,**,—,... Dow,shuttle,DAX,Ifo,... Exchange 30 0.5,44,0.2,0.3,... Dutroux,Lopez,Morris,Hamanaka,... GM,Ford,Barrick,Anglo,... 50,1.50,0.50,0.05,... 10 Electric,Motor,Life,Chemical,... I : 10-year,collective,dual,30-year,... For,Like Lynch,Fleming,Reagan,Brandford,... American 17,19,21 Net,Operating,Primary,Minority,... No British 1991,1989,1949,1979,... Olivetti,Eurotunnel,Elf,Lagardere,... Olivetti,Eurotunnel,Elf,Lagardere,... Friday 200,300,150,120,... Plc,Oy,NV,AB,... Plc,Oy,NV,AB,... Investors,Banks,Companies,Farmers,... 26,28,29 S&P,Burns,Tietmeyer,Rifkind,... Republicans,Bonds,Rangers,Greens,... Prices,Results,Sales,Thousands,... 27,22,23 Soviet,Homeland,Patriotic Senate,senate Sept,Nov.,Oct.,Oct,... 3,A1,C1,C3,... Telecommunications,Broadcasting,Futures,Rail,... Soviet,Homeland,Patriotic Trading,Interest,Demand,Production,... 30 Texas,Queensland,Ohio,Illinois,... That Treasury,mortgage-backed 50,1.50,0.50,0.05,... U.S. Trading,Interest,Demand,Production,... York,York-based VW,Conrail,Bre-X,Texaco,... U.S. age,identity,integrity,identification,... advertising,ad VW,Conrail,Bre-X,Texaco,... bond,floating-rate am,’m We,Things books,words,budgets,clothes,... businesses all consumer dollar,greenback,ecu analyst,meteorologist convertible,bonus,Brady,subordinated,... five-year,three-year,two-year,four-year,... because increase interest get been loss if,whenever,wherever business months investors cents,pence,p.m.,a.m.,... no-fly,year-ago,corresponding,buffer,... no compared,coupled,compares people occupation,Index,Kurdistan,Statements,... could quarter plan results stocks rose under there

Figure 1: The bipartite graphs show the top 40 non-stopword Brown cluster pair features for all four classification tasks. Each node on the left and on the right represents word cluster from Arg1 and Arg2 respectively. We only show the clusters that appear fewer than six times in the top 3,000 pairs to exclude stopwords. Although the four tasks are interrelated, some of the highest mutual information features vary substantially across tasks.

ficity within the of the relation. For exam- provides a generalization of the other. Thus, these ple, a cluster with industrial categories (Electric, shift detection features could help distinguish EX- Motor, Life, Chemical, Automotive) couples with PANSION relations. specific brands or companies (GM, Ford, Barrick, Anglo). Or such a pair might simply reflects a shift We found a few common coreference patterns in plurality e.g. businesses - business and Analysts of names in written English to be useful. First and -analyst.EXPANSION relations capture relations last name are used in the first sentence to refer to a in which one argument provides a specification of person who just enters the discourse. That person the previous and relations in which one argument is referred to just by his/her title and last name in the following sentence. This pattern is found to be

650 All coreference Subject coreference We count the number of pairs of arguments that 0.5 are linked by a coreference chain for each type of 0.4 relation. The coreference chains used in this study

0.3 are detected automatically from the training set through Stanford CoreNLP suite (Raghunathan et 0.2 al., 2010; Lee et al., 2011; Lee et al., 2013). TEM- Coreferential rate Coreferential 0.1 PORAL relations have a significantly higher coref-

0.0 erential rate than the other three relations (p < 0.05, pair-wise t-test corrected for multiple com-

ExpansionTemporal ExpansionTemporal parisons). The differences between COMPARI- ComparisonContingency ComparisonContingency SON,CONTINGENCY, and EXPANSION, however, Figure 2: The coreferential rate for TEMPORAL are not statistically significant (Figure 2). relations is significantly higher than the other three The choice to use or not to use a discourse relations (p < 0.05, corrected for multiple com- connective is strongly motivated by linguistic fea- parison). tures at the discourse levels (Patterson and Kehler, 2013). Additionally, it is very uncommon to have temporally-related sentences without using informative for EXPANSION relations. For exam- explicit discourse connectives. The difference in ple, the edges (not shown in the graph due to lack coreference patterns might be one of the factors of space) from the first name clusters to the title that influence the choice of using a discourse con- (Mr, Mother, Judge, Dr) cluster. nective to signal a TEMPORAL relation. If sen- Time expressions constitutes the majority of the tences are coreferentially linked, then it might be nodes in the bipartite graph for TEMPORAL rela- more natural to drop a discourse connective be- tions. More strikingly, the specific dates (e.g. clus- cause the temporal ordering can be easily inferred ters that have positive integers smaller than 31) without it. For example, are more frequently found in Arg2 than Arg1 in (1) Her story is partly one of personal down- implicit TEMPORAL relations. It is possible that fall. [previously] She was an unstinting TEMPORAL relations are more naturally expressed teacher who won laurels and inspired stu- without a discourse connective if a time point is dents... (WSJ0044) clearly specified in Arg2 but not in Arg1. TEMPORAL relations might also be implicitly The coreference chain between the two inferred through detecting a shift in quantities. We temporally-related sentences in (1) can easily notice that clusters whose words indicate changes be detected. Inserting previously as suggested e.g. increase, rose, loss pair with number clusters. by the annotation from the PDTB corpus does Sentences in which such pairs participate might be not add to the temporal coherence of the sen- part of a narrative or a report where one expects a tences and may be deemed unnecessary. But the change over time. These changes conveyed by the presence of coreferential link alone might bias sentences constitute a natural sequence of events the inference toward TEMPORAL relation while that are temporally related but might not need ex- CONTINGENCY might also be inferred. plicit temporal expressions. Additionally, we count the number of pairs of arguments whose grammatical subjects are linked 6.2 Coreference features by a coreference chain to reveal the syntactic- Coreference features are very effective given that coreferential patterns in different relation types. they constitute a very small set compared to the Although this specific pattern seems rare, more other feature sets. In particular, excluding them than eight percent of all relations have coreferen- from the model reduces F1 scores for TEMPORAL tial grammatical subjects. We observe the same and CONTINGENCY relations by approximately statistically significant differences between TEM- 1% relative to the system that uses all of the PORAL relations and the other three types of re- features. We found that the sentence pairs in these lations. More interestingly, the subject coreferen- two types of relations have distinctive coreference tial rate for CONTINGENCY relations is the lowest patterns. among the three categories (p < 0.05, pair-wise t-test corrected for multiple comparisons).

651 It is possible that coreferential subject patterns word class induction method is that the words that suggest temporal coherence between the two sen- are classified to the same clusters usually make tences without using an explicit discourse connec- an interpretable lexical class by the virtue of their tive. CONTINGENCY relations, which can only in- distributional properties. This word representation dicate causal relationships when realized implic- has been used successfully to augment the perfor- itly, impose the temporal ordering of events in the mance of many NLP systems (Ritter et al., 2011; arguments; i.e. if Arg1 is causally related to Arg2, Turian et al., 2010). then the event described in Arg1 must temporally Louis et al. (2010) uses multiple aspects of precede the one in Arg2. Therefore, CONTIN- coreference as features to classify implicit dis- GENCY and TEMPORAL can be highly confusable. course relations without much success while sug- To understand why this pattern might help distin- gesting many aspects that are worth exploring. In a guish these two types of relations, consider these corpus study by Louis and Nenkova (2010), coref- examples: erential rates alone cannot explain all of the rela- tions, and more complex coreference patterns have (2) He also asserted that exact questions weren’t to be considered. replicated. [Then] When referred to the ques- tions that match, he said it was coincidental. 8 Conclusions (WSJ0045) We present statistical classifiers for identifying (3) He also asserted that exact questions weren’t senses of implicit discourse relations and intro- replicated. When referred to the questions duce novel feature sets that exploit distributional that match, she said it was coincidental. similarity and coreference information. Our clas- sifiers outperform the classifiers from previous When we switch out the coreferential subject work in all types of implicit discourse relations. for an arbitrary uncoreferential pronoun as we do Altogether these results present a stronger base- in (3), we are more inclined to classify the relation line for the future research endeavors in implicit as CONTINGENCY. discourse relations. In addition to enhancing the performance of the 7 Related work classifier, Brown word cluster pair features dis- Word-pair features are known to work very well close some of the new aspects of implicit dis- in predicting senses of discourse relations in an course relations. The feature analysis confirms artificially generated corpus (Marcu and Echi- our hypothesis that cluster pair features work well habi, 2002). But when used with a realistic cor- because they encapsulate relevant word classes pus, model parameter estimation suffers from data which constitute more complex informative fea- sparsity problem due to the small dataset size. Bi- tures such as named-entity pairs of the same cat- ran and McKeown (2013) attempts to solve this egories, semantically-related pairs, and pairs that problem by aggregating word pairs and estimating indicate specificity-generality shift. At the dis- weights from an unannotated corpus but only with course level, Brown clustering is superior to a limited success. one-hot word representation for identifying inter- Recent efforts have focused on introducing sentential patterns and the interactions between meaning abstraction and semantic representation words. between the words in the sentence pair. Pitler et al. Coreference chains that traverse through the (2009) uses external lexicons to replace the one- discourse in the text shed the light on differ- hot word representation with semantic information ent types of relations. The preliminary analy- such as word polarity and various verb classifica- sis shows that TEMPORAL relations have much tion based on specific theories (Stone et al., 1968; higher inter-argument coreferential rates than the Levin, 1993). Park and Cardie (2012) selects an other three senses of relations. Focusing on only optimal subset of these features and establishes the subject-coreferential rates, we observe that CON- strongest baseline to best of our knowledge. TINGENCY relations show the lowest coreferential Brown word clusters are hierarchical clusters rate. The coreference patterns differ substantially induced by frequency of co-occurrences with other and meaningfully across discourse relations and words (Brown et al., 1992). The strength of this deserve further exploration.

652 Beth Levin. 1993. English verb classes and alter- nations: A preliminary investigation, volume 348. Or Biran and Kathleen McKeown. 2013. Aggregated University of Chicago press Chicago. word pair features for implicit discourse relation dis- ambiguation. In Proceedings of the 51st Annual Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. Meeting of the Association for Computational Lin- 2010. A PDTB-Styled End-to-End Discourse guistics, pages 69–73. The Association for Compu- Parser. arXiv.org, November. tational Linguistics. Nick Littlestone. 1988. Learning quickly when irrele- Peter F Brown, Peter V deSouza, Robert L Mercer, vant attributes abound: A new linear-threshold algo- Vincent J Della Pietra, and Jenifer C Lai. 1992. rithm. Machine learning, 2(4):285–318. Class-based n -gram models of natural language. Computational Linguistics, 18(4):467–479, Decem- Annie Louis and Ani Nenkova. 2010. Creating lo- ber. cal coherence: An empirical assessment. In Human Language Technologies: The 2010 Annual Confer- Chih-Chung Chang and Chih-Jen Lin. 2011. LIB- ence of the North American Chapter of the Associa- SVM: A library for support vector machines. ACM tion for Computational Linguistics, pages 313–316. Transactions on Intelligent Systems and Technol- Association for Computational Linguistics. ogy, 2:27:1–27:27. Software available at http:// www.csie.ntu.edu.tw/ cjlin/libsvm. ˜ Annie Louis, Aravind Joshi, Rashmi Prasad, and Ani Nenkova. 2010. Using entity features to classify Marie-Catherine De Marneffe, Bill MacCartney, implicit discourse relations. In Proceedings of the Christopher D Manning, et al. 2006. Generat- 11th Annual Meeting of the Special Interest Group ing typed dependency parses from phrase structure on Discourse and Dialogue, pages 59–62. Associa- parses. In Proceedings of LREC, volume 6, pages tion for Computational Linguistics. 449–454.

Jenny Rose Finkel, Trond Grenager, and Christopher William C Mann and Sandra A Thompson. 1988. Manning. 2005. Incorporating non-local informa- Rhetorical structure theory: Toward a functional the- tion into information extraction systems by gibbs ory of text organization. Text, 8(3):243–281. sampling. In Proceedings of the 43rd Annual Meet- ing on Association for Computational Linguistics, Daniel Marcu and Abdessamad Echihabi. 2002. An pages 363–370. Association for Computational Lin- unsupervised approach to recognizing discourse re- guistics. lations. In Proceedings of the 40th Annual Meet- ing on Association for Computational Linguistics, Nathalie Japkowicz. 2000. Learning from imbalanced pages 368–375. Association for Computational Lin- data sets: a comparison of various strategies. In guistics. AAAI workshop on learning from imbalanced data sets, volume 68. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large anno- Michael Jordan and Andrew Ng. 2002. On discrimi- tated corpus of english: The penn treebank. Compu- native vs. generative classifiers: A comparison of lo- tational linguistics, 19(2):313–330. gistic regression and naive bayes. Advances in neu- ral information processing systems, 14:841. Andrew Kachites McCallum. 2002. Mal- let: A machine learning for language toolkit. Dan Klein and Christopher D Manning. 2003. Accu- http://www.cs.umass.edu/ mccallum/mallet. rate unlexicalized parsing. In the 41st Annual Meet- ing, pages 423–430, Morristown, NJ, USA. Associ- Joonsuk Park and Claire Cardie. 2012. Improving im- ation for Computational Linguistics. plicit discourse relation recognition through feature set optimization. In Proceedings of the 13th Annual Heeyoung Lee, Yves Peirsman, Angel Chang, Meeting of the Special Interest Group on Discourse Nathanael Chambers, Mihai Surdeanu, and Dan Ju- and Dialogue, pages 108–112. Association for Com- rafsky. 2011. Stanford’s multi-pass sieve coref- putational Linguistics. erence resolution system at the conll-2011 shared task. In Proceedings of the Fifteenth Conference on Gary Patterson and Andrew Kehler. 2013. Predicting Computational Natural Language Learning: Shared the presence of discourse connectives. In Proceed- Task, pages 28–34. Association for Computational ings of the Conference on Empirical Methods in Nat- Linguistics. ural Language Processing. Association for Compu- tational Linguistics. Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Ju- Emily Pitler, Mridhula Raghupathy, Hena Mehta, Ani rafsky. 2013. Deterministic coreference resolu- Nenkova, Alan Lee, and Aravind K Joshi. 2008. tion based on entity-centric, precision-ranked rules. Easily identifiable discourse relations. Technical Computational Linguistics. Reports (CIS), page 884.

653 Emily Pitler, Annie Louis, and Ani Nenkova. 2009. Ben Wellner, James Pustejovsky, Catherine Havasi, Automatic sense prediction for implicit discourse re- Anna Rumshisky, and Roser Sauri. 2009. Clas- lations in text. In Proceedings of the Joint Confer- sification of discourse coherence relations: An ex- ence of the 47th Annual Meeting of the ACL and the ploratory study using multiple knowledge sources. 4th International Joint Conference on Natural Lan- In Proceedings of the 7th SIGdial Workshop on Dis- guage Processing of the AFNLP: Volume 2-Volume course and Dialogue, pages 117–125. Association 2, pages 683–691. Association for Computational for Computational Linguistics. Linguistics. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt- sakaki, Livio Robaldo, Aravind K Joshi, and Bon- nie L Webber. 2008. The penn discourse treebank 2.0. In LREC. Citeseer. Karthik Raghunathan, Heeyoung Lee, Sudarshan Ran- garajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. 2010. A multi- pass sieve for coreference resolution. In Proceed- ings of the 2010 Conference on Empirical Meth- ods in Natural Language Processing, EMNLP ’10, pages 492–501, Stroudsburg, PA, USA. Association for Computational Linguistics. Alan Ritter, Sam Clark, Oren Etzioni, et al. 2011. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empiri- cal Methods in Natural Language Processing, pages 1524–1534. Association for Computational Linguis- tics. Philip Stone, Dexter C Dunphy, Marshall S Smith, and DM Ogilvie. 1968. The general inquirer: A com- puter approach to content analysis. Journal of Re- gional Science, 8(1). Kristina Toutanova and Christopher D Manning. 2000. Enriching the knowledge sources used in a maxi- mum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th An- nual Meeting of the Association for Computational Linguistics-Volume 13, pages 63–70. Association for Computational Linguistics. Kristina Toutanova, Dan Klein, Christopher D Man- ning, and Yoram Singer. 2003. Feature-rich part-of- speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computa- tional Linguistics on Human Language Technology- Volume 1, pages 173–180. Association for Compu- tational Linguistics. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Compu- tational Linguistics, pages 384–394. Association for Computational Linguistics. Xun Wang, Sujian Li, Jiwei Li, and Wenjie Li. 2012. Implicit discourse relation recognition by selecting typical training examples. In Proceedings of COL- ING 2012, pages 2757–2772, Mumbai, India, De- cember. The COLING 2012 Organizing Committee.

654