Arxiv:1904.10419V2 [Cs.CL] 30 Aug 2019 (E.G
Total Page:16
File Type:pdf, Size:1020Kb
GumDrop at the DISRPT2019 Shared Task: A Model Stacking Approach to Discourse Unit Segmentation and Connective Detection Yue Yu Yilun Zhu Yang Liu Computer Science Linguistics Linguistics Georgetown University Georgetown University Georgetown University Yan Liu Siyao Peng Mackenzie Gong Amir Zeldes Analytics Linguistics CCT Linguistics Georgetown University Georgetown University Georgetown University Georgetown University fyy476,yz565,yl879,yl1023,sp1184,mg1745,[email protected] Abstract However, as recent work (Braud et al., 2017b) has shown, performance on smaller or less homo- In this paper we present GumDrop, George- town University’s entry at the DISRPT 2019 geneous corpora than RST-DT, and especially in Shared Task on automatic discourse unit seg- the absence of gold syntax trees (which are real- mentation and connective detection. Our ap- istically unavailable at test time for practical ap- proach relies on model stacking, creating a plications), hovers around the mid 80s, making it heterogeneous ensemble of classifiers, which problematic for full discourse parsing in practice. feed into a metalearner for each final task. The This is more critical for languages and domains in system encompasses three trainable compo- which relatively small datasets are available, mak- nent stacks: one for sentence splitting, one for ing the application of generic neural models less discourse unit segmentation and one for con- nective detection. The flexibility of each en- promising. semble allows the system to generalize well to The DISRPT 2019 Shared Task aims to identify datasets of different sizes and with varying lev- spans associated with discourse relations in data els of homogeneity. from three formalisms: RST (Mann and Thomp- son, 1988), SDRT (Asher, 1993) and PDTB 1 Introduction (Prasad et al., 2014). The targeted task varies ac- Although discourse unit segmentation and con- toss frameworks: Since RST and SDRT segment nective detection are crucial for higher level shal- texts into spans covering the entire document, the low and deep discourse parsing tasks, recent years corresponding task is to predict the starting point have seen more progress in work on the lat- of new discourse units. In the PDTB framework, ter tasks than on predicting underlying segments, the basic locus identifying explicit discourse rela- such as Elementary Discourse Units (EDUs). As tions is the spans of discourse connectives which the most recent overview on parsing in the frame- need to be identified among other words. In total, work of Rhetorical Structure Theory (RST, Mann 15 corpora (10 from RST data, 3 from PDTB-style and Thompson 1988) points out (Morey et al., data, and 2 from SDRT) in 10 languages (Basque, 2017, 1322) “all the parsers in our sample ex- Chinese, Dutch, English, French, German, Por- cept [two] predict binary trees over manually seg- tuguese, Russian, Spanish, and Turkish) are used mented EDUs”. Recent discourse parsing papers as the input data for the task. The heterogeneity of arXiv:1904.10419v2 [cs.CL] 30 Aug 2019 (e.g. Li et al. 2016, Braud et al. 2017a) have fo- the frameworks, languages and even the size of the cused on complex discourse unit span accuracy training datasets all render the shared task chal- above the level of EDUs, attachment accuracy, lenging: training datasets range from the smallest and relation classification accuracy. This is due in Chinese RST corpus of 8,960 tokens to the largest part to the difficulty in comparing systems when English PDTB dataset of 1,061,222 tokens, and the underlying segmentation is not identical (see all datasets have some different guidelines. In this Marcu et al. 1999), but also because of a rela- paper, we therefore focus on creating an architec- tively stable SOA accuracy of EDU segmentation ture that is not only tailored to resources like RST- as evaluated on the largest RST corpus, the En- DT, and takes into account the crucial importance glish RST Discourse Treebank (RST-DT, Carlson of high accuracy sentence splitting for real-world et al. 2003), which already exceeded 90% accu- data, generalizing well to different guidelines and racy in 2010 (Hernault et al., 2010). datasets. Our system, called GumDrop, relies on model on the right. More recently, Braud et al.(2017b) stacking (Wolpert, 1992), which has been success- used a bi-LSTM-CRF sequence labeling approach fully applied to a number of complex NLP prob- on dependency parses, with words, POS tags, de- lems (e.g. Clark and Manning 2015, Friedrichs pendency relations and the same features for each et al. 2017). The system uses a range of differ- word’s parent and grand-parent tokens, as well as ent rule-based and machine learning approaches the direction of attachment (left or right), achiev- whose predictions are all fed to a ‘metalearner’ or ing F-scores of .89 on segmenting RST-DT with blender classifier, thus benefiting from both neu- parser-predicted syntax, and scores in the 80s, near ral models where appropriate, and strong rule- or above previous SOA results, for a number of based baselines coupled with simpler classifiers other corpora and languages. for smaller datasets. A further motivation for our By contrast, comparatively little work has ap- model stacking approach is curricular: the system proached discourse connective detection as a sep- was developed as a graduate seminar project in arate task, as it is usually employed as an interme- the course LING-765 (Computational Discourse diate step for predicting discourse relations. Pitler Modeling), and separating work into many sub- and Nenkova(2009) used a Max Entropy classi- modules allowed each contributor to work on a fier using a set of syntactic features extracted from separate sub-project, all of which are combined the gold standard Penn Treebank (Marcus et al., in the complete system as an ensemble. The sys- 1993) parses of PDTB (Prasad et al., 2008) arti- tem was built by six graduate students and the in- cles, such as the highest node which dominates structor, with each student focusing on one mod- exactly and only the words in the connective, the ule (notwithstanding occasional collaborations) in category of the immediate parent of that phrase, two phases: work on a high-accuracy ensemble and the syntactic category of the sibling imme- sentence splitter for the automatic parsing scenario diately to the left/right of the same phrase. Pat- (see Section 3.2), followed by the development of terson and Kehler(2013) presented a logistic re- a discourse unit segmenter or connective detection gression model trained on eight relation types ex- module (Sections 3.3 and 3.4). tracted from PDTB, with features in three cate- gories: Relation-level features such as the con- 2 Previous Work nective signaling the relation, attribution status of the relation, and its relevance to financial in- Following early work on rule-based segmenters formation; Argument-level features, capturing the (e.g. Marcu 2000, Thanh et al. 2004), Soricut and size or complexity of each of its two arguments; Marcu(2003) used a simple probabilistic model and Discourse-level features, which incorporate conditioning on lexicalized constituent trees, by the dependencies between the relation in question using the highest node above each word that has and its neighboring relations in the text. a right-hand sibling, as well as its children. Like Polepalli Ramesh et al.(2012) used SVM our approach, this and subsequent work below and CRF for identifying discourse connectives in perform EDU segmentation as a token-wise bi- biomedical texts. The Biomedical Discourse Re- nary classification task (boundary/no-boundary). lation Bank (Prasad et al., 2011) and PDTB were In a more complex model, Sporleder and Lapata used for in-domain classifiers and novel domain (2005) used a two-level stacked boosting classifier adaptation respectively. Features included POS on syntactic chunks, POS tags, token and sentence tags, the dependency label of tokens’ immediate lengths, and token positions within clauses, all of parents in a parse tree, and the POS of the left which are similar to or subsumed by some of our neighbor; domain-specific semantic features in- features below. They additionally used the list of cluded several biomedical gene/species taggers, in English connectives from Knott(1996) to identify addition to NER features predicted by ABNER (A connective tokens. Biomedical Named Entity Recognition). Hernault et al.(2010) used an SVM model with features corresponding to token and POS trigrams 3 GumDrop at and preceding a potential segmentation point, as Our system is organized around three ensembles well as features encoding the lexical head of each which implement model stacking. token’s parent phrase in a phrase structure syn- tax tree and the same features for the sibling node 1. A trainable sentencer ensemble which feeds Ensemble Sentencer Syntax Parsing Freq Meta-learner UDPipe RNN BIIIOOOOB Corpora without Meta-learner OOOBOOO gold syntax UDPipe BIIOOBO… NLTK DNN Wiki Ensemble Conn-Detector LR Ensemble Segmenter 100000010 Meta-learner 000001000 00010000… Subtree BOW RNN Corpora with gold syntax Figure 1: System architecture. The raw text from corpora without gold syntax is first split into sentences by the ensemble sentencer. Sentences are then parsed using UDPipe. Corpora with predicted or gold syntax can then be utilized for discourse unit segmentation and connective detection. an off-the-shelf dependency parser ules using those features for sentence splitting and EDU segmentation/connective detection. Features 2. A discourse unit segmenter ensemble, oper- derived from syntax trees are not available for sen- ating on either gold or predicted sentences tence splitting, though automatic POS tagging us- 3. A connective detector ensemble, also using ing the TreeTagger (Schmid, 1994) was used as a gold or predicted sentences feature for this task, due to its speed and good ac- Each module consists of several distinct sub- curacy in the absence of sentence splits.