Paraphrasing vs Coreferring: Two Sides of the Same Coin

Yehudit Meged1 Avi Caciularu1,2 Vered Shwartz2,3 Ido Dagan1 1Computer Science Department, Bar-Ilan University, Ramat-Gan, Israel 2Allen Institute for Artificial Intelligence 3Paul G. Allen School of Computer Science & Engineering, University of Washington {yehuditmeged,avi.c33}@gmail.com, [email protected], [email protected]

Abstract Tara Reid has checked into∨ Promises Treatment Center. Actress Tara Reid entered∨ well-known Malibu rehab center. We study the potential synergy between two Lindsay Lohan checked into× rehab in Malibu, California.

different NLP tasks, both confronting predi- Director Chris Weitz is expected to direct∨ New Moon. cate lexical variability: identifying predicate Chris Weitz will take on∨ the sequel to “”. paraphrases, and event coreference resolution. Gary Ross is still in negotiations to direct× the sequel. First, we used annotations from an event coreference dataset as distant supervision to Table 1: Examples from ECB+ (a cross-document re-score heuristically-extracted predicate para- coreference dataset) that illustrate the context-sensitive phrases. The new scoring gained more than nature of event coreference. The illustrated predicates 18 points in average precision upon their rank- are co-referable, and hence may be used to refer to the ing by the original scoring method. Then, we same event in certain contexts, but obviously not all used the same re-ranking features as additional their mentions corefer. inputs to a state-of-the-art event coreference resolution model, which yielded modest but consistent improvements to the model’s perfor- phrases. The former identifies and clusters event mance. The results suggest a promising direc- mentions, across multiple documents, that refer tion to leverage data and models for each of to the same event within their respective contexts. the tasks to the benefit of the other. The latter task, on the other hand, collects pairs of event expressions that, at the generic lexical level, 1 Introduction may refer to the same event in certain contexts. Recognizing that mentions of different lexical Table1 illustrates this difference with examples predicates discuss the same event is challenging of co-referable predicate paraphrases, while their (Barhom et al., 2019). Lexical resources such as mentions obviously do not always co-refer. WordNet (Miller, 1995) capture such synonyms Cross-document event coreference resolution (say, tell) and hypernyms (whisper, talk), as well as systems are typically supervised, usually trained on antonyms, which can be used to refer to the same the ECB+ dataset, which contains clusters of news event when the arguments are reversed ([a]0 beat articles on different topics (Cybulska and Vossen, [a]1, [a]1 lose to [a]0). However, WordNet’s cov- 2014). Recent systems rely on neural representa- erage is insufficient, in particular, missing context- tions of the mentions and their contexts (Kenyon- arXiv:2004.14979v2 [cs.CL] 9 Oct 2020 specific paraphrases (e.g. (hide, launder), in the Dean et al., 2018; Barhom et al., 2019), while ear- context of money). Conversely, distributional meth- lier approaches leveraged WordNet and other lex- ods enjoy broader coverage, but their precision for ical resources to obtain a signal of whether a pair this purpose is limited because distributionally sim- of mentions may be coreferring (e.g. Bejan and ilar terms may often be mutually-exclusive (born, Harabagiu, 2010; Yang et al., 2015). die) or may refer to different event types which Approaches for acquiring predicate paraphrase, are only temporally or causally related (sentenced, in the form of a pair of paraphrastic predicates or convicted). predicate templates, were based mostly on unsuper- Two prominent lines of work pertaining to iden- vised signals. These included similarity between ar- tifying predicates whose meaning or referents can gument distributions (Lin and Pantel, 2001; Berant, be matched are cross-document (CD) event coref- 2012), backtranslation across languages (Barzilay erence resolution and recognizing predicate para- and McKeown, 2001; Ganitkevitch et al., 2013; Mallinson et al., 2017), or leveraging redundant The standard datasets used for CD event corefer- news reports on the same event, which are hence ence training and evaluation are ECB+ (Cybulska likely to refer to the same events and entities using and Vossen, 2014), and its predecessors, EECB different words (Shinyama et al., 2002; Shinyama (Lee et al., 2012) and ECB (Bejan and Harabagiu, and Sekine, 2006; Barzilay and Lee, 2003; Zhang 2010). ECB+ contains a set of topics, each contain- and Weld, 2013; Xu et al., 2014; Shwartz et al., ing a set of documents describing the same global 2017). In some cases, the paraphrase collection event. Both event and entity coreferences are anno- phase includes a step of validating a subset of the tated in ECB+, within and across documents. paraphrases and training a model on these gold Models for CD event coreference utilize a range paraphrases to re-rank the entire resource (Lan of features, including lexical overlap among men- et al., 2017). tion pairs and semantic knowledge from WordNet In this paper, we study the potential synergy (Bejan and Harabagiu, 2010, 2014; Yang et al., between predicate paraphrases and event corefer- 2015), distributional (Choubey and Huang, 2017) ence resolution. We show that the data and models and contextual representations (Kenyon-Dean et al., for one task can benefit the other. In one direc- 2018; Barhom et al., 2019). tion (Section3), we use event coreference anno- The current state-of-the-art model from Barhom tations from the ECB+ dataset as distant super- et al.(2019) iteratively and intermittently learns vision to learn an improved scoring of predicate to cluster events and entities. A mention repre- paraphrases in the unsupervised Chirps resource sentation mi consists of several components, rep- (Shwartz et al., 2017). The distantly supervised resenting both the mention span and its surround- scorer significantly improves upon ranking by the ing context. The interdependence between cluster- original Chirps scores, adding 18 points to average ing event vs. entity mentions is encoded into the precision over a test sample. mention representation, such that an event mention In the other direction (Section4), we incorpo- representation contains a component reflecting the rate data from Chirps, represented in the Chirps current entity clustering, and vice versa. Using this re-scorer feature vector, into a state-of-the-art event representation, the model trains a pairwise mention coreference system (Barhom et al., 2019). Chirps scoring function that predicts the probability that has a substantial coverage over the ECB+ corefer- two mentions refer to the same event. ring mention pairs, and consequently, the incorpo- ration yields a modest but consistent improvement 2.2 Paraphrase Identification and acquisition across the various coreference metrics.1 Paraphrases are differing textual realizations of the same meaning (Ganitkevitch et al., 2013), typically 2 Background and Motivation phrases or sentences (Dolan et al., 2005). A promi- nent approach for identifying and collecting para- In this section we provide some background about phrases, backtranslation, assumes that if two (say) the cross-document coreference resolution and English phrases translate to the same term in a for- paraphrase identification (acquisition) tasks, which eign language, across multiple foreign languages, is relevant for our approaches for synergizing these this indicates that these two phrases are paraphrases. two tasks. This approach was first suggested by Barzilay and McKeown(2001), later adapted to acquire the large 2.1 Event Coreference Resolution PPDB resource (Ganitkevitch et al., 2013), and was Event coreference resolution aims to identify and also shown to work well with neural machine trans- cluster event mentions, that, within their respective lation (Mallinson et al., 2017). contexts, refer to the same event. The task has Paraphrase Identification through Event Coref- two variants, one in which coreferring mentions erence. An alternative approach for paraphrase are within the same document (within document) identification, on which we focus in this paper, and another in which corefering mentions may be leverages multiple news documents discussing the in different documents (cross-document, CD), on same event. The underlying assumption is that which we focus in this paper. such redundant texts may refer to the same entities 1Code available at github.com/yehudit96/coreferrability, or events using lexically-divergent mentions. Co- github.com/yehudit96/event entity coref ecb plus referring mentions are identified heuristically and extracted as candidate paraphrases. When long doc- news headlines posted on on the same day. uments are used, the first step in this approach is It extracts binary predicate-argument tuples from to align each pair of documents by sentences. This each tweet and aligns pairs of predicate mentions was done by finding sentences with shared named whose arguments match, by some lexical match- entities (Shinyama et al., 2002) or lexical overlap ing criteria. The matched pairs of arguments are (Barzilay and Lee, 2003; Shinyama and Sekine, termed supporting pairs, e.g. (Chuck Berry, 90) for 2006), and by aligning pairs of predicates or ar- “[Chuck Berry]0 died at [90]1” and “[Chuck Berry]0 guments (Zhang and Weld, 2013; Recasens et al., lived until [90]1”. The predicate paraphrases, i.e. 2013). In more recent work, Xu et al.(2014) and pairs of predicate templates (like “[a0] died at [a1]” Lan et al.(2017) extracted sentential paraphrases and “[a0] lived until [a1]”) are then aggregated and from Twitter by heuristically matching pairs of ranked with the following (unsupervised) heuristic tweets discussing the same topic. scoring function: s = n · (1 + d ). Predicate Paraphrases. In contrast to sentential N paraphrases, it is also beneficial to identify differ- This score is proportional to the number of sup- ing textual templates of the same meaning. In this porting pair instances in which the two templates paper we focus on binary predicate paraphrases were paired (n), as well as the number of different days in which such pairings were found (d), where such as (“[a0] quit from [a1]”, “[a0] resign from N is the number of days the resource is collected. [a1]”). Earlier approaches for acquiring predicate para- The Chirps resource provides the scored predicate phrases considered a pair of predicate templates as paraphrases as well as the supporting pairs for each paraphrases if the distributions of their argument paraphrase. Chirps has acquired more than 5 million distinct instantiations were similar. For instance, in “[a0] paraphrase pairs over the last 3 years. Human eval- quit from [a1]”, [a0] would typically be instantiated uation showed that this scoring is effective and that by people names while [a1] by employer organi- zations or job titles. A paraphrastic template like the percentage of correct paraphrases is higher for highly scored paraphrases. At the same time, due “[a0] resign from [a1]” is hence expected to have similar argument distributions, and can thus be de- to the heuristic collection and scoring of predicate tected by a distributional similarity approach (Lin paraphrases in Chirps, entries in the resource may and Pantel, 2001; Szpektor et al., 2004; Berant, suffer from two types of errors: (1) type 1 error, i.e., 2012). Yet, as mentioned earlier, predicates with the heuristic recognized pairs of non-paraphrastic similar argument distributions are not necessarily predicates as paraphrases. This happens when the paraphrastic, which introduces a substantial level same arguments participate in multiple different of noise when acquiring paraphrase pairs using this events, as in the following paraphrases: “[Police]0 approach. arrest [man]1” and “[Police]0 shoot [man]1”; and (2) type 2 error, when the scoring function assigned In this paper, we follow the potentially more re- a low score to a rare but correct paraphrase pair, as liable paraphrase acquisition approach, which tries in “[a ] outgun [a ]” and “[a ] outperform [a ]”, to heuristically identify concrete co-referring pred- 0 1 0 1 for which only a single supporting pair was found. icate mentions. Identifying such mention pairs, detected as actually being used to refer to the same 3 Chirps*: Leveraging Coreference event, can provide a strong signal for identifying Information for Paraphrasing these predicates as paraphrastic (vs. the quite noisy corpus-level signal of distributional similarity). In Our goal in this section is to improve paraphrase particular, we utilize the Chirps paraphrase acqui- scoring, in the context of Chirps, while leverag- sition method and resource, which follows this ap- ing available information and methods for event proach as described next in some detail. cross-document coreference resolution. To that end, we introduce Chirps*, a new supervised scorer for Chirps: a Coreference-Driven Paraphrase Re- Chirps candidate paraphrases, whose novelties are source. Chirps (Shwartz et al., 2017) is a re- two fold. First, we extract a richer feature represen- source of predicate paraphrases extracted heuris- tation for a candidate paraphrase pair (Section 3.1), tically from Twitter. Chirps aims to recognize which is fed into a supervised classifier for the can- coreferring events by relying on the redundancy of didates. Second, we collect, semi-automatically, distantly supervised training data for paraphrase tweet, respectively. We define a Named Entity classification, which is derived from the ECB+ Coverage score, NEC, as the maximum ratio of cross-document coreference training set, leverag- named entity coverage of one article by the other: ing the close relationship between the two tasks  T T  NEC(NE ,NE ) = max |NE1 NE2| , |NE1 NE2| (Section 3.2). Finally, we provide some implemen- 1 2 |NE1| |NE2| tation details (Section 3.3). We manually annotated a small balanced train- 3.1 Features ing set of 121 tweet pairs and used it to tune a As described above, the original heuristic Chirps score threshold T = 0.26, such that pairs of tweets scorer relied only on a couple of features to score whose NEC is at least T are considered corefer- a candidate paraphrase pair. Our goal is to obtain ring. Finally, we include the following features: a richer signal about the likelihood of a candidate the number of coreferring tweet pairs (whose NEC predicate pair to indeed be paraphrastic. To that score exceeds T ) and the average NEC score of end, we collect a set of features from the avail- these pairs. able data, with a focus on assessing whether the instances from which the candidate pair was ex- Cross-document Coreference Resolution We tracted indeed constitute cross-document corefer- apply the state-of-the-art cross-document coref- ences. erence model from Barhom et al.(2019) to data Each candidate paraphrase pair consists of constructed such that each tweet constitutes a doc- ument and each pair of tweets corresponding to two predicate templates p1 and p2, accompa- tj tj support-pairs(p , p ) nied by the n supporting pair instances for the 1 and 2 in 1 2 forms a topic, pair, each consisting a pair of argument terms, to be analyzed for coreference. As input for the associated with this predicate paraphrase pair: model, in each tweet, we mark the corresponding 1 1 n n predicate span as an event mention and the two support-pairs(p1, p2) = {(t1, t2), ..., (t1 , t2 )}. Each tweet included in Chirps links to a news arti- argument spans as entity mentions. The model cle, whose content we retrieve. When representing outputs whether the two event mentions corefer a pair of predicate templates, we include both local (yielding a single event coreference cluster for the features (based on a single supporting pair) and two mentions) or not (yielding two singleton clus- global features (based on all supporting pairs). ters). Similarly, it clusters the four arguments to entity coreference clusters. Table2 presents our 17 features, yielding a fea- 17 Differently from Chirps, this model makes its ture representation fp1,p2 ∈ R for a paraphrase pair, grouped by different sources of information. event clustering decision based on the predicate, The first group includes features derived from the arguments, and the context of the full tweet, as op- statistics provided by the original Chirps resource. posed to considering the arguments alone. Thus, The other 4 sources of information are described in we expect it not to cluster predicates whose argu- the following paragraphs. ments match lexically, if their contexts or pred- icates don’t match (first example in Table3). Named Entity Coverage While the original In addition, the model’s mentions representa- Chirps method did not utilize the content of the tion might help to identify lexically-divergent yet linked article, we find it useful to retrieve more semantically-similar arguments (second example information about the event. Specifically, it might in Table3). help mitigating errors in Chirps’ argument match- For a given pair of tweets, we extract the follow- ing mechanism, which relies on argument align- ing binary features with respect to the predicate ment considering only the text of the two tweets. mentions: Event Perfect when the predicates are We found that the original mechanism worked par- assigned to the same cluster, and Event No Match ticularly well for named entities while being more when each predicate forms a singleton cluster. For error-prone for common nouns, which might re- argument mentions, we extract the following fea- quire additional context. tures: Entity Perfect if the two a0 arguments belong i i Given (t1, t2) ∈ support-pairs(p1, p2), we use to one cluster and the two a1 arguments belong to SpaCy (Honnibal and Montani, 2017) to extract another cluster; Entity Reverse if at least one of sets of named entities, NE1 and NE2, from the the a0 arguments is clustered as coreferring with first paragraph of the news article linked from each the a1 argument in the other tweet; and Entity No Name Description

The number of different predicate paraphrase pairs with p1 and p as predicates, regardless of argument ordering, e.g. “[a ] # Templates 2 0 release [a1]” / “[a0] reveal [a1]” and “release [a0] [a1]” / “reveal [a0] [a1]” are both counted. The total number of support pairs of p and p across the tem- # Supporting pairs 1 2 plate variants. Chirps- The total number of days d that p and p was matched in based # Days 1 2 Features Chirps across the template variants. The number of support pairs of p and p across the template # Available supporting pairs 1 2 variants that were still available to download. The total number of days d in which the support pairs above # Days of available pairs occurred in the available tweets. Score The maximal Chirps score across the template variants.

Named # NEC above threshold Number of pairs with NEC score of at least T . Entity Coverage Average above threshold Average of NEC scores for pairs with a score of at least T .

# Event Perfect Number of event pairs with perfect match. # Event No Match Cross- Number of event pairs with no match. document # Entity Perfect Number of entity pairs with perfect match. Coreference # Entity Reverse Number of entity pairs with reverse match. Resolution # Entity No Match Number of entity pairs with no match. The number of pairs with NEC score of at least T and perfect # Perfectly Clustered + NEC clustering for event coreference resolution.

Connected # Connected components The number of connected components in G . Compo- p1,p2

nents Average component size The average size of connected components in Gp1,p2 .

The number of pairs in support-pairs(p , p ) that are in a Clique # In Clique 1 2 clique.

Table 2: All 17 features used by our scorer, for a given predicate paraphrase pair p1, p2, as detailed in Section 3.1.

[Police]0 [arrest] [two men]1 in incident at Westboro Beach. supporting pairs were matched relative to the en- [Police]0 [kill] [man]1 in Vegas hospital who grabbed gun. tire collection period. The latter lowers the score [Police]0 [arrest] [man]1 in incident at Westboro Beach. of paraphrase pairs which might have been mistak- [Officers]0 [seize] [guy]1 in incident at Westboro Beach. enly aligned on relatively few days (e.g. due to mis- leading argument alignments in particular events). Table 3: Examples of coreference errors made by Chirps and corrected by Barhom et al.(2019): 1) false The number of days in which the predicates were positive: wrong man / two men alignment (disregard- aligned is taken as a proxy for the number of differ- ing location modifiers). 2) (hypothetical) false negative: ent events in which the predicates co-refer. Here, lexically-divergent yet semantically-similar arguments. we aim to get a more reliable partition of tweets to different events by constructing a graph of tweets as nodes, with supporting tweet pairs as edges, and Match otherwise. Also, we extract Perfectly Clus- looking for connected components. tered with NE Coverage that combines both Named Entity coverage and coreference-resolution, which To that end, we define a bipartite graph count the number of pairs their events are perfectly Gp1,p2 = (V,E) for a candidate paraphrase clustered and with NEC score of at least T. pair, where V = tweets(p1, p2) contains all the tweets in which p1 or p2 appeared, and Connected Components The original Chirps E = support-pairs(p1, p2). We compute C, the score of a predicate paraphrase pair is proportional number of connected components in Gp1,p2 , and to two parameters: (1) the number of supporting define the following group: ConComp = {c ∈ pairs; (2) the ratio of number of days in which C : |c| > 2}, which represents the number of connected components with size greater than Pre Annotation Post Annotation 2. From this group we derive two features Train # positive 266 803 #connected(p1, p2) = |ConComp| which repre- # negative 2040 1056 sents the number of the connected components and Dev # positive 93 222 # negative 539 318 avg connected(p1, p2), which is the average size Test # positive 131 352 of the connected components in the graph. A larger # negative 758 411 number of connected components indicates that the two predicates were aligned across a large number Table 4: Statistics of the paraphrase scorer dataset. The difference in size before and after the annotation is due of likely different events. to omitting examples with less than 3 supporting pairs. Clique We similarly build a global tweet graph 0 0 ch-ECB+ Test Set ch-Random for all the predicate pairs, Gall = (V ,E ), 0 Model AP Accuracy (P / R / F 1) AP where V = ∪(p1,p2) tweets(p1, p2), and E0 = ∪ support-pairs(p , p ). We compute GloVe 50.7 55.7 (58.1 / 14.2 / 22.8) 50.1 (p1,p2) 1 2 Chirps 62.5 60.4 (59.9 / 45.7 / 51.6) 51.4 Q, the set of cliques in Gall of size greater Chirps* 80.0 73.8 (74.1 / 66.5 / 70.1) 59.5 than 2. We assume that a pair of tweets are more likely to be coreferring if they are part Table 5: Ranker (AP) and classification (Accuracy, Pre- of a bigger clique, whereas if they were ex- cision, Recall, F 1) results on ch-ECB+ test set (middle tracted by mistake they wouldn’t share many column), and Ranker results on ch-Random, a subset of 500 randomly selected predicate pairs from the en- neighbors. We extract the following feature of tire Chirps resource (right column). clique coverage for a candidate paraphrase pair: j j CLC(p1, p2) = |{t1, t2 ∈ support-pairs(p1, p2): j j ∃q ∈ Q such that t1 ∈ q ∧ t2 ∈ q}|. coreferring in a given context. To reduce the rate of such false-negative examples, we validated all 3.2 Distantly Supervised Labels the candidate negative examples and a sample of In order to learn to score the paraphrases, we need the positive examples using Amazon Mechanical gold standard labels, i.e., labels indicating whether Turk. Following Shwartz et al.(2017), we anno- a pair of predicate templates collected by Chirps is tated the templates while presenting 3 argument indeed a paraphrase. Instead of collecting manual instantiations from their original tweets. Thus, we annotations for a sample of the Chirps data, we only included in the final data predicate pairs with chose a low-budget distant supervision approach. at least 3 supporting pairs. We required that work- To that end, we leverage the similarity between the ers have 99% approval rate on at least 1,000 prior predicate paraphrase extraction and the event coref- tasks and pass a qualification test. erence resolution tasks, and use the annotations Each example was annotated by 3 workers. We from the ECB+ dataset. aggregated the per-instantiation annotations using Our dataset consists of the predicate paraphrases majority vote and considered a pair as positive if at from Chirps that appear in ECB+ (denoted ch- least one instantiation was judged as positive. The ECB+). As positive examples we consider all pairs data statistics are given in Table4. This validation of predicates p1, p2 from Chirps that appear in the phase balanced the positive-negative proportion of same event cluster in ECB+, e.g., from {talk, say, instances in the data, from approximately 1:7 to tell, accord to, confirm} we extract (talk, say), (talk, approximately 4:5. tell), ..., (accord to, confirm). 3.3 Model Obtaining negative examples is a bit trickier. We consider as negative example candidates pairs of We trained a random forest classifier (Breiman, predicates p1, p2 from Chirps, which are under the 2001) implemented by the scikit-learn framework same topic, but in different event clusters in ECB+, (Pedregosa et al., 2011). To tune the hyper- e.g., given the clusters {specify, reveal, say}, and parameters, we ran a 3 fold cross-validation ran- {get}, we extract (specify, get), (reveal, get), and domized search, yielding the following values: 157 (say, get). estimators, max depth of 8, minimum samples leaf 2 Note that the ECB+ annotations are context- of 1, and min samples split of 10. dependent. Thus a pair of predicates that are in 2We chose random forest over a neural model because of principle coreferable may be annotated as non- the small size of the training set. Rank GloVe Chirps Chirps* Table5 displays the accuracy, precision, recall 1 take over / take up announce / unveil launch / unveil and F1 scores for classification evaluation and the 11 begin / continue introduce / unveil buy / purchase 21 find / go accuse / warn add / introduce Average Precision (AP) for ranking evaluation. Our 31 move / plan launch / unveil announce / launch 41 buy / pay announce / enter accuse / warn scorer dramatically improves upon the baselines in 51 die / kill accuse / threaten cast / set all metrics. 61 announce / confirm arrest / charge say / warn 71 deal / say kill / tell fire / use To show that the improved scoring generalizes 81 condemn / deny attack / strike cause / trigger 91 believe / move claim / kill refresh / upgrade beyond examples that appear in the ECB+ dataset, we selected a random subset of 500 predicate pairs Table 6: Highly ranked paraphrases from the intersec- with at least 6 support pairs from the entire Chirps tion of the candidate coreference pairs in the ECB+ test resource and annotated them in the same method set and Chirps (ch-ECB+ test set). The table presents described in Section 3.2. The ranker evaluated on 10 highly ranked paraphrases, as ranked by each of this subset gained 8 points in AP, relative to the GloVe, Chirps and our method, taken from ranks 1, 11, 21 and so on in each ranking. Pairs labeled as positive original Chirps ranking. All results are statistically in our gold dataset are highlighted in purple. significant using bootstrap and permutation tests with p < 0.001 (Dror et al., 2018). Ablated Feature Set AP Classification Metrics Table6 exemplifies highly ranked predicate pairs Acc. P R F1 by our Chirps* scorer, the original Chirps scorer Chirps 73.94 72.96 74.36 52.25 61.38 and the GloVe scorer, which illustrates the im- Named Entity Coverage 76.43 74.44 73.86 58.56 65.33 proved ranking performance of Chirps* (as mea- Coreference 75.22 72.04 72.32 51.8 60.37 Connected Component 76.59 73.51 74.53 54.05 62.66 sured in table5 by the AP score). Clique 76.78 73.19 72.51 55.85 63.1 Ablation Test To evaluate the importance of All Features 77.13 73.88 73.96 56.3 63.94 each type of feature, we perform an ablation test. Table 7: Ablation test results on the ch-ECB+ dev set Table7 displays the performance of various ablated for the ranking and classification models. models, each of which with one set of features (Section 3.1) removed from the representation. In the classification task, removing the named entity 3.4 Evaluation coverage features somewhat improved the perfor- We used the model for two purposes: (1) classifi- mance, mostly by increasing the recall. However, cation: determining if a pair of predicate templates in terms of the (primary) ranking evaluation, each are paraphrases or not; and (2) ranking the pairs set of features contributed to the performance, with based on the predicted positive class score. We con- the full model performing best. sider the ranking evaluation as more informative, as we expect the ranking to reflect the number of 4 Leveraging a Paraphrasing Resource contexts in which a pair of predicates may be core- to Improve Coreference ferring. That is, predicate pairs that are coreferring In Section3 we showed that leveraging CD event in many contexts will be ranked higher than those coreference annotations and model improves predi- that are coreferring in just a few contexts. cate paraphrase ranking. In this section, we show We compare our model with two baselines: the that this co-dependence can be used in both direc- original Chirps scores, and a baseline that assigns tions, and that using Chirps* as an external resource each pair of predicates the cosine similarity scores can improve the performance of a CD model. between the predicates using GloVe embeddings As a preliminary analysis, we computed Chirps’ (Pennington et al., 2014).3 For the classification coverage of lexically-divergent pairs of co-referring decisions made by the two baseline scores (Chirps event mentions in ECB+. We found approximately score and cosine similarity for Glove vectors), we 30% coverage overall and above 50% coverage learn a threshold that yields the best accuracy score for coreferring verbal mentions. 4 This indicates over the train set, above which a pair of predicates a substantial coverage of the lexically-divergent is classified as positive. positive coreferrability decisions that need to be 3We were motivated to compare with Glove since this made in ECB+. In absolute numbers, Chirps covers resource is utilized as a lexical representation in the state-of- the-art system of (Barhom et al., 2019), which is utilized in the 4Non-verbal mentions in ECB+ include nominalizations next section. Here, multi-word predicates were represented by (investigation), names (Oscars) acronyms (DUI), and more. the average of their Glove word vectors. Chirps, by design, consists of verbal predicates only. MUC B3 CEAF-e CoNLL Model RP F1 RP F1 RP F1 F1 LEMMA BASELINE 76.5 79.9 78.1 71.7 85 77.8 75.5 71.7 73.6 76.5 Barhom et al.(2019) 77.6 84.5 80.9 76.1 85.1 80.3 81 73.8 77.3 79.5 Barhom et al.(2019) + C HIRPS*FEATURES 78.7 84.67 81.61 75.87 85.91 80.5 81.09 74.77 77.8 80.0 Table 8: Event mentions coreference-resolution results on ECB+ test set. about 370 lexically-divergent pairs of coreferring Pairwise Score event mentions appearing in the ECB+ training set, and about 200 in the test set. MLPscorer 4.1 Integration Method The state-of-the-art CD coreference resolution model, by Barhom et al.(2019), trained a pairwise mention scoring function, MLPscorer(mi, mj), Mention Pair Chirps* Features which predicts the probability that two mentions Representation mi, mj refer to the same event. The mention rep- resentation includes a lexical component (GloVe MLPch embeddings) as well as a contextual component Barhom et al. Representation (ELMo embeddings, Peters et al., 2018). The men- Generator Raw Chirps* tion pair representation ~vi,j, which is fed to the Features pairwise scorer, combines the two separate men- tion representations. Chirps* Feature We extended the model by changing the input to Extractor the pairwise event mention scoring function to in- clude information regarding the mention pair from mi, mj Chirps*, as illustrated in Figure1. We defined Mention 0 v~i,j = [~vi,j;~ci,j], where ~ci,j denotes the Chirps* Pair features, computed in the following way: Figure 1: An illustration of the integrated mention pair scorer. The right vector is the original mention pair ( ~ vector from Barhom et al.(2019), and the left one is MLPch(fmi,mj ) if mi, mj ∈ Chirps our Chirps* extension, which is transformed through ~ci,j = MLPch(~0) otherwise MLPch into the same embedding space. The two vec- tors are concatenated to form the mention pair represen- ~ 17 tation, which is fed to the scoring function MLPscorer. fmi,mj ∈ R is the feature vector representing a pair of predicates (mi, mj) for which there is an entry in Chirps, otherwise the input is a zero CEAF-e (Luo, 2005) and CoNLL F1 (the average 3 vector. MLPch is an MLP with a single hidden of MUC, B and CEAF-e scores). layer of size 50 and output layer of size 100, which We compare the integrated model to the original ~ is used to transform the discrete values in fmi,mj model and to the lemma baseline which clusters into the same embedding space of ~vi,j. The rest of together mentions that share the same mention- the model remains the same, including the model head lemma. The results in Table8 show that the architecture, training, and inference.5 Chirps-enhanced model provides an improvement of 3.5 points over the lemma baseline and a small 4.2 Evaluation improvement upon Barhom et al.(2019) in all F1 We evaluate the event coreference performance on score measures. The greatest improvement is in the ECB+ using the official CoNLL scorer (Pradhan link-based MUC measure, which counts the num- et al., 2014). The reported metrics are MUC (Vi- ber of corresponding links between the mentions. lain et al., 1995), B3 (Bagga and Baldwin, 1998), The Chirps component helps link more coreferring mentions (improving recall) and prevents the link- 5We also tried to incorporate only the final Chirps* score into the mention pair representation, but the performance im- ing of some wrong mentions (improving precision). provement was smaller. Although the gap between our model and the original model by Barhom et al.(2019) is statisti- multiple-sequence alignment. In Proceedings of the cally significant (bootstrap and permutation tests 2003 Human Language Technology Conference of with p < 0.001), it is rather small. We can at- the North American Chapter of the Association for Computational Linguistics, pages 16–23. tribute it partly to the coverage of Chirps over ECB+ (around 30%), which entails that the ma- Regina Barzilay and Kathleen R. McKeown. 2001. Ex- jority of event mention pairs still have the same tracting paraphrases from a parallel corpus. In Pro- representation as in the original model. We also ceedings of the 39th Annual Meeting of the Associ- ation for Computational Linguistics, pages 50–57, note that ECB+ suffers from annotation errors, as Toulouse, France. Association for Computational was observed by Barhom et al.(2019) and others. Linguistics. 5 Conclusion and Future Work Cosmin Bejan and Sanda Harabagiu. 2010. Unsuper- vised event coreference resolution with rich linguis- We studied the synergy between the tasks of identi- tic features. In Proceedings of the 48th Annual Meet- fying predicate paraphrases and event coreference ing of the Association for Computational Linguis- tics, pages 1412–1422, Uppsala, Sweden. Associa- resolution, both concerned with matching the mean- tion for Computational Linguistics. ings of lexically-divergent predicates, and showed that they can benefit each other. Using event Cosmin Adrian Bejan and Sanda Harabagiu. 2014. Un- coreference annotations as distant supervision, we supervised event coreference resolution. Computa- tional Linguistics, 40(2):311–347. learned to re-rank predicate paraphrases that were initially ranked heuristically, and managed to in- Jonathan Berant. 2012. Global Learning of Textual En- crease their average precision substantially. In the tailment Graphs. Ph.D. thesis, Tel Aviv University. other direction, we incorporated knowledge from Machine learn- our re-ranked predicate paraphrases resource into Leo Breiman. 2001. Random forests. ing, 45(1):5–32. a model for event coreference resolution, yielding a small improvement upon previous state-of-the- Prafulla Kumar Choubey and Ruihong Huang. 2017. art results. We hope that our study will encourage Event coreference resolution by iteratively unfold- future research to make further progress on both ing inter-dependencies among events. In Proceed- ings of the 2017 Conference on Empirical Methods tasks jointly. in Natural Language Processing, pages 2124–2133, Copenhagen, Denmark. Association for Computa- Acknowledgements tional Linguistics.

This work was supported in part by grants from In- Agata Cybulska and Piek T. J. M. Vossen. 2014. Using tel Labs, Facebook, the Israel Science Foundation a sledgehammer to crack a nut? lexical diversity and grant 1951/17, the Israeli Ministry of Science and event coreference resolution. In LREC. Technology and the German Research Foundation Bill Dolan, Chris Brockett, and Chris Quirk. 2005. through the German-Israeli Project Cooperation Microsoft research paraphrase corpus. Retrieved (DIP, grant DA 1600/1-1). March, 29(2008):63.

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Re- References ichart. 2018. The hitchhiker’s guide to testing statis- tical significance in natural language processing. In Amit Bagga and Breck Baldwin. 1998. Algorithms Proceedings of the 56th Annual Meeting of the As- for scoring coreference chains. In The first interna- sociation for Computational Linguistics (Volume 1: tional conference on language resources and evalua- Long Papers), pages 1383–1392, Melbourne, Aus- tion workshop on linguistics coreference, volume 1, tralia. Association for Computational Linguistics. pages 563–566. Granada. Juri Ganitkevitch, Benjamin Van Durme, and Chris Shany Barhom, Vered Shwartz, Alon Eirew, Michael Callison-Burch. 2013. Ppdb: The paraphrase Bugert, Nils Reimers, and Ido Dagan. 2019. Re- database. In Proceedings of the 2013 Conference of visiting joint modeling of cross-document entity and the North American Chapter of the Association for event coreference resolution. In Proceedings of the Computational Linguistics: Human Language Tech- 57th Annual Meeting of the Association for Com- nologies, pages 758–764. putational Linguistics, pages 4179–4189, Florence, Italy. Association for Computational Linguistics. Matthew Honnibal and Ines Montani. 2017. Spacy. Natural language understanding with bloom embed- Regina Barzilay and Lillian Lee. 2003. Learning dings, convolutional neural networks and incremen- to paraphrase: An unsupervised approach using tal parsing. Kian Kenyon-Dean, Jackie Chi Kit Cheung, and Doina Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Precup. 2018. Resolving event coreference with Gardner, Christopher Clark, Kenton Lee, and Luke supervised representation learning and clustering- Zettlemoyer. 2018. Deep contextualized word repre- oriented regularization. In Proceedings of the sentations. arXiv preprint arXiv:1802.05365. Seventh Joint Conference on Lexical and Com- putational Semantics, pages 1–10, New Orleans, Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Ed- Louisiana. Association for Computational Linguis- uard Hovy, Vincent Ng, and Michael Strube. 2014. tics. Scoring coreference partitions of predicted men- tions: A reference implementation. In Proceed- Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. ings of the 52nd Annual Meeting of the Association A continuously growing dataset of sentential para- for Computational Linguistics (Volume 2: Short Pa- phrases. In Proceedings of the 2017 Conference on pers), pages 30–35, Baltimore, Maryland. Associa- Empirical Methods in Natural Language Processing, tion for Computational Linguistics. pages 1224–1234, Copenhagen, Denmark. Associa- tion for Computational Linguistics. Marta Recasens, Matthew Can, and Daniel Jurafsky. 2013. Same referent, different words: Unsupervised Heeyoung Lee, Marta Recasens, Angel Chang, Mihai mining of opaque coreferent mentions. In Proceed- Surdeanu, and Dan Jurafsky. 2012. Joint entity and ings of the 2013 Conference of the North Ameri- event coreference resolution across documents. In can Chapter of the Association for Computational Proceedings of the 2012 Joint Conference on Empir- Linguistics: Human Language Technologies, pages ical Methods in Natural Language Processing and 897–906, Atlanta, Georgia. Association for Compu- Computational Natural Language Learning, pages tational Linguistics. 489–500, Jeju Island, Korea. Association for Com- putational Linguistics. Yusuke Shinyama and Satoshi Sekine. 2006. Preemp- tive information extraction using unrestricted rela- Dekang Lin and Patrick Pantel. 2001. DIRT - Discov- tion discovery. In Proceedings of the Human Lan- ery of inference rules from text. In Proceedings of guage Technology Conference of the NAACL, Main the seventh ACM SIGKDD international conference Conference, pages 304–311, City, USA. on Knowledge discovery and data mining, pages Association for Computational Linguistics. 323–328. Yusuke Shinyama, Satoshi Sekine, and Kiyoshi Sudo. Xiaoqiang Luo. 2005. On coreference resolution per- 2002. Automatic paraphrase acquisition from news Proceedings of the second interna- formance metrics. In Proceedings of Human Lan- articles. In guage Technology Conference and Conference on tional conference on Human Language Technology Research Empirical Methods in Natural Language Processing, , pages 313–318. Morgan Kaufmann Pub- pages 25–32, Vancouver, British Columbia, Canada. lishers Inc. Association for Computational Linguistics. Vered Shwartz, Gabriel Stanovsky, and Ido Dagan. 2017. Acquiring predicate paraphrases from news Jonathan Mallinson, Rico Sennrich, and Mirella Lap- tweets. In Proceedings of the 6th Joint Conference ata. 2017. Paraphrasing revisited with neural ma- on Lexical and Computational Semantics (*SEM Proceedings of the 15th Con- chine translation. In 2017), pages 155–160, Vancouver, Canada. Associa- ference of the European Chapter of the Association tion for Computational Linguistics. for Computational Linguistics: Volume 1, Long Pa- pers, pages 881–893, Valencia, Spain. Association Idan Szpektor, Hristo Tanev, Ido Dagan, and Bonaven- for Computational Linguistics. tura Coppola. 2004. Scaling web-based acquisition of entailment relations. In Proceedings of the 2004 George A Miller. 1995. Wordnet: a lexical database for Conference on Empirical Methods in Natural Lan- english. Communications of the ACM, 38(11):39– guage Processing, pages 41–48. 41. Marc Vilain, John Burger, John Aberdeen, Dennis Con- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, nolly, and Lynette Hirschman. 1995. A model- B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, theoretic coreference scoring scheme. In Proceed- R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, ings of the 6th conference on Message understand- D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- ing, pages 45–52. Association for Computational esnay. 2011. Scikit-learn: Machine learning in Linguistics. Python. Journal of Machine Learning Research, 12:2825–2830. Wei Xu, Alan Ritter, Chris Callison-Burch, William B Dolan, and Yangfeng Ji. 2014. Extracting lexically Jeffrey Pennington, Richard Socher, and Christopher divergent paraphrases from twitter. Transactions Manning. 2014. Glove: Global vectors for word rep- of the Association for Computational Linguistics, resentation. In Proceedings of the 2014 Conference 2:435–448. on Empirical Methods in Natural Language Process- ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso- Bishan Yang, Claire Cardie, and Peter Frazier. 2015. A ciation for Computational Linguistics. hierarchical distance-dependent bayesian model for event coreference resolution. Transactions of the As- sociation for Computational Linguistics, 3:517–528. Congle Zhang and Daniel S Weld. 2013. Harvest- ing parallel news streams to generate paraphrases of event relations. In Proceedings of the 2013 Con- ference on Empirical Methods in Natural Language Processing, pages 1776–1786.