Implicit Discourse Relation Classification Via Multi-Task Neural
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Implicit Discourse Relation Classification via Multi-Task Neural Networks Yang Liu1, Sujian Li1,2, Xiaodong Zhang1 and Zhifang Sui1,2 1 Key Laboratory of Computational Linguistics, Peking University, MOE, China 2 Collaborative Innovation Center for Language Ability, Xuzhou, Jiangsu, China {cs-ly, lisujian, zxdcs, szf}@pku.edu.cn Abstract pairs occurring in the sentence pairs are useful, since they to some extent can represent semantic relationships between Without discourse connectives, classifying implicit dis- two sentences (for example, word pairs appearing around course relations is a challenging task and a bottleneck for building a practical discourse parser. Previous re- the contrast relation often tend to be antonyms). In earlier search usually makes use of one kind of discourse studies, researchers find that word pairs do help classifying framework such as PDTB or RST to improve the clas- the discourse relations. However, it is strange that most of sification performance on discourse relations. Actually, these useful word pairs are composed of stopwords. Ruther- under different discourse annotation frameworks, there ford and Xue (2014) point out that this counter-intuitive exist multiple corpora which have internal connections. phenomenon is caused by the sparsity nature of these word To exploit the combination of different discourse cor- pairs. They employ Brown clusters as an alternative abstract pora, we design related discourse classification tasks word representation, and as a result, they get more intuitive specific to a corpus, and propose a novel Convolutional cluster pairs and achieve a better performance. Neural Network embedded multi-task learning system to synthesize these tasks by learning both unique and Another problem in discourse parsing is the coexistence shared representations for each task. The experimental of different discourse annotation frameworks, under which results on the PDTB implicit discourse relation classifi- different kinds of corpora and tasks are created. The well- cation task demonstrate that our model achieves signif- known discourse corpora include the Penn Discourse Tree- icant gains over baseline systems. Bank (PDTB) (Prasad et al. 2007) and the Rhetorical Struc- ture Theory - Discourse Treebank (RST-DT) (Mann and Introduction Thompson 1988). Due to the annotation complexity, the size of each corpus is not large enough. Further, these corpora Discourse relations (e.g., contrast and causality) support a under different annotation frameworks are usually used sep- set of sentences to form a coherent text. Automatically iden- arately in discourse relation classification, which is also the tifying these discourse relations can help many downstream main reason of sparsity in discourse relation classification. NLP tasks such as question answering and automatic sum- However, these different annotation frameworks have strong marization. internal connections. For example, both the Elaboration and Under certain circumstances, these relations are in the Joint relations in RST-DT have a similar sense as the Expan- form of explicit markers like “but” or “because”, which is sion relation in PDTB. Based on this, we consider to design relatively easy to identify. Prior work (Pitler, Louis, and multiple discourse analysis tasks according to these frame- Nenkova 2009) shows that where explicit markers exist, re- works and synthesize these tasks with the goal of classifying lation types can be disambiguated with F1 scores higher than the implicit discourse relations, finding more precise repre- 90%. However, without an explicit marker to rely on, classi- sentations for the sentence pairs. The most inspired work to fying the implicit discourse relations is much more difficult. us is done by (Lan, Xu, and Niu 2013) where they regard The fact that these implicit relations outnumber the explicit implicit and explicit relation classification in PDTB frame- ones in naturally occurring text makes the classification of work as two tasks and design a multi-task learning method their types a key challenge in discourse analysis. to obtain higher performance. The major line of research work approaches the implicit In this paper, we propose a more general multi-task learn- relation classification problem by extracting informed fea- ing system for implicit discourse relation classification by tures from the corpus and designing machine learning al- synthesizing the discourse analysis tasks within different gorithms (Pitler, Louis, and Nenkova 2009; Lin, Kan, and corpora. To represent the sentence pairs, we construct the Ng 2009; Louis et al. 2010). An obvious challenge for clas- convolutional neural networks (CNNs) to derive their vector sifying discourse relations is which features are appropri- representations in a low dimensional latent space, replacing ate for representing the sentence pairs. Intuitively, the word the sparse lexical features. To combine different discourse Copyright c 2016, Association for the Advancement of Artificial analysis tasks, we further embed the CNNs into a multi-task Intelligence (www.aaai.org). All rights reserved. neural network and learn both the unique and shared repre- 2750 sentations for the sentence pairs in different tasks, which can Multi-Task Neural Network for Discourse reflect the differences and connections among these tasks. Parsing With our multi-task neural network, multiple discourse tasks are trained simultaneously and optimize each other through Motivation and Overview their connections. Different discourse corpora are closely related, though un- der different annotation theories. In Table 1, we list some instances which have similar discourse relations in nature Prerequisite but are annotated differently in different corpora. The sec- ond row belongs to the Elaboration relation in RST-DT. The As stated above, to improve the implicit discourse relation third and fourth row are both Expansion relations in PDTB: classification, we make full use of the combination of differ- one is implicit and the other is explicit with the connective ent discourse corpora. In our work, we choose three kinds “in particular”. The fifth row is from the NYT corpus and of discourse corpora: PDTB, RST-DT and the natural text directly uses the word “particularly” to denote the discourse with discourse connective words. In this section, we briefly relation between two sentences. From these instances, we introduce the corpora. can see they all reflect the similar discourse relation that the second argument gives more details of the first argument. It PDTB The Penn Discourse Treebank (PDTB) (Prasad et is intuitive that the classification performance on these in- al. 2007), which is known as the largest discourse corpus, is stances can be boosted from each other if we appropriately composed of 2159 Wall Street Journal articles. PDTB adopts synthesize them. With this idea, we propose to adopt the the predicate-argument structure, where the predicate is the multi-task learning method and design a specific discourse discourse connective (e.g. while) and the arguments are two analysis task for each corpus. text spans around the connective. In PDTB, a relation is ex- According to the principle of multi-task learning, the plicit if there is an explicit discourse connective presented in more related the tasks are, the more powerful the multi-task the text; otherwise, it is implicit. All PDTB relations are hi- learning method will be. Based on this, we design four dis- erarchically organized into 4 top-level classes: Expansion, course relation classification tasks. Comparison, Contingency, and Temporal and can be fur- Task 1 Implicit PDTB Discourse Relation Classification ther divided into 16 types and 23 subtypes. In our work, we Task 2 Explicit PDTB Discourse Relation Classification mainly experiment on the 4 top-level classes as in previous work (Lin, Kan, and Ng 2009). Task 3 RST-DT Discourse Relation Classification Task 4 Connective Word Classification RST-DT RST-DT is based on the Rhetorical Structure The first two tasks both classify the relation between two Theory (RST) proposed by (Mann and Thompson 1988) and arguments in the PDTB framework. The third task is to pre- is composed of 385 articles. In this corpus, a text is repre- dict the relations between two EDUs using our processed sented as a discourse tree whose leaves are non-overlapping RST-DT corpus. The last one is designed to predict a correct text spans called elementary discourse units (EDUs). Since connective word to a sentence pair by using the NYT corpus. we mainly focus on discourse relation classification, we We define the task of classifying implicit PDTB relations as make use of the discourse dependency structure (Li et al. our main task, and the other tasks as the auxiliary tasks. This 2014) converted from the tree structures and extracted the means we will focus on learning from other tasks to improve EDU pairs with labeled rhetorical relations between them. the performance of our main task in our system. It is noted In RST-DT, all relations are classified into 18 classes. We that we call the two text spans in all the tasks as arguments choose the highest frequent 12 classes and get 19,681 rela- for convenience. tions. Next, we will introduce how we tackle these tasks. In our work, we propose to use the convolutional neural networks Raw Text with Connective Words There exists a large (CNNs) for representing the argument pairs. Then, we em- amount of raw text with connective words in it. As we know, bed the CNNs into a multi-task neural network (MTNN), these connective words serve as a natural means to con- which can learn the shared and unique properties of all the nect text spans. Thus, the raw text with connective words is tasks. somewhat similar to the explicit discourse relations in PDTB without expert judgment and can also be used as a spe- CNNs: Modeling Argument Pairs cial discourse corpus. In our work, we adopt the New York Figure 1 illustrates our proposed method of modeling the ar- Times (NYT) Corpus (Sandhaus 2008) with over 1.8 mil- gument pairs.