
Coreference for Discourse Parsing: A Neural Approach Grigorii Guz and Giuseppe Carenini Department of Computer Science University of British Columbia Vancouver, BC, Canada, V6T 1Z4 fgguz, [email protected] Abstract 1-5 S We present preliminary results on investigat- Elab ing the benefits of coreference resolution fea- 2-5 tures for neural RST discourse parsing by con- N Temp sidering different levels of coupling of the dis- N course parser with the coreference resolver. In 2-4 S N particular, starting with a strong baseline neu- Elab ral parser unaware of any coreference informa- N 3-4 tion, we compare a parser which utilizes only N N the output of a neural coreference resolver, List with a more sophisticated model, where dis- 1 2 3 4 5 6 course parsing and coreference resolution are jointly learned in a neural multitask fashion. Figure 1: An example (Asher and Lascarides, 2003) Results indicate that these initial attempts to of a discourse being ill-formed due to the invalid incorporate coreference information do not anaphoric link. The leaf EDUs are as follows: [Max boost the performance of discourse parsing in had a great evening last night.]1 [He had a great meal.]2 a statistically significant way. [He ate salmon.]3 [He devoured lots of cheese.]4 [He then won a dancing competition.]5 [It was a beautiful 1 Introduction and Task Description pink]6 Discourse parsing is a very useful Natural Lan- guage Processing (NLP) task involving predicting Previous research has shown that the use of RST- and analyzing discourse structures, which repre- style discourse parsing as a system component can sent the coherence properties and relations among enhance important tasks, such as sentiment analy- constituents of multi-sentential documents. In this sis, summarization and text categorization (Bhatia work, we investigate discourse parsing in the con- et al., 2015; Nejat et al., 2017; Hogenboom et al., text of Rhetorical Structure Theory (RST) Mann 2015; Gerani et al., 2014; Ji and Smith, 2017). And and Thompson(1988), which encodes documents more recently, it has been found that RST discourse into complete constituency discourse trees. An structures can complement learned contextual em- RST tree is defined on the sequence of a docu- beddings (e.g., BERT (Devlin et al., 2018)), in ment’s EDUs (Elementary Discourse Units), which tasks where linguistic information on complete doc- are clause-like sentences or sentence fragments uments is critical, such as argumentation analysis (propositions), acting as the leaves of the tree. Ad- (Chakrabarty et al., 2019). jacent EDUs and constituents are hierarchically ag- In this work, we present preliminary results of gregated to form (possibly non-binary) constituents, investigating the benefits of coreference resolu- with internal nodes containing (1) a nuclearity la- tion features for RST discourse parsing. From bel, defining the importance of that subtree (rooted the theoretical perspective, it has long been estab- at the internal node) in the local context and (2) a lished (Asher and Lascarides, 2003) that discourse relation label, defining the type of semantic con- structure can impose constraints on mention an- nection between the two subtrees (e.g., Elaboration, tecedent distributions, with these constraints being Background). derived from the role of each discourse unit (sen- 160 Proceedings of the First Workshop on Computational Approaches to Discourse, pages 160–167 Online, November 20, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 tence or EDU) with respect to the global discourse. 2 Related Work The Veins theory (Cristea et al., 1998) is the most known formalization of anaphoric constraints with Dai and Huang(2019) have already explored the respect to RST tree structures, involving assigning benefit of using coreference information for neural to each EDU a subset of preceding EDUs defined PDTB implicit discourse relation classification, in by the nuclearity attributes of the EDU’s parent a way similar to our SpanBERT-CorefFeats model. nodes in the document’s discourse tree (see Ap- In our study, we also explore the use of shared pendixA for the exact definition). These constrains encoder architecture for both tasks to detect the act as a domain of referential accessibility where additional possible synergy. the antecedents must reside, for otherwise the dis- Modelwise, the most common approach to in- course would be considered incoherent. As an ex- fer discourse trees is the linear bottom-up shift- ample of this phenomenon, consider the discourse reduce method, adopted from syntactic parsing. structure in Figure1. In principle, a reader could Wang et al.(2017) uses hand-crafted features and apply commonsense knowledge to resolve the pro- the shift-reduce method predicted by two separate noun it in the last sentence to salmon in the third Support-Vector-Machines (SVMs) for structure- sentence, any proficient English speaker would call and nuclearity-prediction and relation-estimation. such a discourse ill-formed and incoherent, due The neural model by Yu et al.(2018) uses a similar to the fact that it breaks the discourse-imposed topology, but instead relies entirely on LSTMs for antecedent scope. In general, anaphora can only automatic feature extraction and on a single multi- be resolved with respect to the most salient (sen- layer-perceptron (MLP) for classifying all possible tence 1 in Figure1) units of the preceding discourse actions. Top-down approaches to discourse pars- (Asher and Lascarides, 2003). For our purposes, ing are also quite promising, with recent work of this means that having access to a document’s coref- Kobayashi et al.(2020) applying ELMO (Peters erence structure might be beneficial to the task of et al., 2018) for computing span representations and predicting the discourse structure, since the corefer- achieving the new absolute SOTA performance, re- ence structure can constrain the discourse parser’s porting however the scores of an ensemble of five solution space. However, as shown in a corpus independent runs of their proposed model instead study by Zeldes(2017), the antecedent boundaries of single-model results. In this work we follow the defined by Veins Theory are often too restrictive, shift-reduce strategy and apply SpanBERT-Base suggesting that while discourse structures can be (Joshi et al., 2020; Wolf et al., 2020), which we useful for predicting coreference structures and introduce below, for encoding the document con- vice versa, these mutual constrains must be defined tents. softly, at least in the context of RST theory. The field of coreference resolution has recently been dominated by deep learning models. The To explore these ideas computationally with re- current SOTA model by Joshi et al.(2020) is spect to modern neural models, we investigate built upon the neural coreference resolver of (Lee the utility of automatically extracted coreference et al., 2018) by incorporating SpanBERT language features and discourse-coreference shared repre- model, which modifies the commonly used BERT sentations in the context and for the benefit of (Devlin et al., 2019) architecture with a novel span neural RST discourse parsing. Our strong base- masking pretraining objective. In our work, we re- line SpanBERT-NoCoref utilizes SpanBERT (Joshi implemented their coreference resolver in PyTorch et al., 2020) as in the current SOTA coreference (Paszke et al., 2019). Our code for both models is resolver, without utilizing any direct coreference available1. information. Next, our SpanBERT-CorefFeats con- siders the output of coreference resolver as per Dai 3 Shift-Reduce Architecture and Huang(2019), letting us test the benefit of predicted and so possibly noisy coreference fea- All our proposed parsers share the same basic shift- tures. Finally, our more sophisticated SpanBERT- reduce architecture, consisting of a Queue, which Multitask model learns discourse parsing together is initially filled with documents EDUs in order with coreference resolution in the neural multitask 1http://www.cs.ubc.ca/ learning fashion, sharing the SpanBERT contextual cs-research/lci/research-groups/ word encoder for both models. natural-language-processing/index.html 161 vS2 vS1 si si+1 si+2 sj sj+1 ... ... ... ... vS2 vS1 ' ' ' ' SpanBERT w i w' w i+2 w j w j+1 ... i+1 ... ... si si+1 si+2 sj sj+1 ... ... ... wi wi+1 wi+2 wj wj+1 wi wi+1 wi+2 wj wj+1 ... ... ... ... ... ... ... S2 S1 Figure 2: Overview of our models. For spans S2 are S1, the wi:i+2 and wj;j+1 respectively are the nuclear EDUs. (Left) All components of SpanBERT-NoCoref. (Middle) SpanBERT-CorefFeats modifies the initial SpanBERT embeddings according to predicted coreference clusters (in red). (Right) SpanBERT-MultiTask updates the final span representations with embeddings of mentions of shared entities. from first to last one, and a Stack, which is initially embeddings for the aforementioned organizational empty, as well as the following actions on them: features. The first classifer predicts the action The Shift delays aggregations of sub-trees at the and nuclearity assignment among yAct;Nuc 2 beginning of the document by popping the top EDU fShift; ReduceNN ; ReduceNS; ReduceSN g, Q1 on the queue and pushing it onto the stack. and in case the Reduce action is chosen, the The Reduce-X aggregates the top subtrees second classifier predicts the discourse relation (S1;S2) on the stack into a single subtree (S1−2). among 18 coarse-grained RST relation classes Each reduce action further defines a nuclear- yRel 2 fAttribution; Elaboration; :::g. ity assignment XN 2 fNN, NS, SNg to the 3.2 Action Classifier Training and Inference nodes covered by S1−2 and a relation XR 2 fElaboration, Contrast, ...g holding between them. Both classifiers are trained using the Cross-Entropy loss, computed for each Stack-Queue parsing step. 3.1 Action Classifier Parametrization At test time, we apply the greedy decoding strategy to predict the discourse structure.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-