Deep Semantic Role Labeling with Self-Attention

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Deep Semantic Role Labeling with Self-Attention Zhixing Tan,1 Mingxuan Wang,2 Jun Xie,2 Yidong Chen,1 Xiaodong Shi1∗ 1School of Information Science and Engineering, Xiamen University, Xiamen, China 2Mobile Internet Group, Tencent Technology Co., Ltd, Beijing, China [email protected], {xuanswang, stiffxie}@tencent.com, {ydchen, mandel}@xmu.edu.cn Abstract achieved the state-of-the-art results. He et al., (2017) re- ported further improvements by using deep highway bidirec- Semantic Role Labeling (SRL) is believed to be a crucial step tional LSTMs with constrained decoding. These successes towards natural language understanding and has been widely involving end-to-end models reveal the potential ability of studied. Recent years, end-to-end SRL with recurrent neu- LSTMs for handling the underlying syntactic structure of ral networks (RNN) has gained increasing attention. How- ever, it remains a major challenge for RNNs to handle struc- the sentences. tural information and long range dependencies. In this pa- Despite recent successes, these RNN-based models have per, we present a simple and effective architecture for SRL limitations. RNNs treat each sentence as a sequence of which aims to address these problems. Our model is based words and recursively compose each word with its previ- on self-attention which can directly capture the relationships ous hidden state. The recurrent connections make RNNs between two tokens regardless of their distance. Our sin- applicable for sequential prediction tasks with arbitrary gle model achieves F1 =83.4 on the CoNLL-2005 shared length, however, there still remain several challenges in =82.7 task dataset and F1 on the CoNLL-2012 shared task practice. The first one is related to memory compression dataset, which outperforms the previous state-of-the-art re- problem (Cheng, Dong, and Lapata 2016). As the entire his- sults by 1.8 and 1.0 F1 score respectively. Besides, our model is computationally efficient, and the parsing speed is 50K to- tory is encoded into a single fixed-size vector, the model kens per second on a single Titan X GPU. requires larger memory capacity to store information for longer sentences. The unbalanced way of dealing with sequential information leads the network performing poorly Introduction on long sentences while wasting memory on shorter ones. The second one is concerned with the inherent structure of Semantic Role Labeling is a shallow semantic parsing task, sentences. RNNs lack a way to tackle the tree-structure of whose goal is to determine essentially “who did what to the inputs. The sequential way to process the inputs remains whom”, “when” and “where”. Semantic roles indicate the the network depth-in-time, and the number of nonlinearities basic event properties and relations among relevant entities depends on the time-steps. in the sentence and provide an intermediate level of semantic To address these problems above, we present a deep atten- representation thus benefiting many NLP applications, such tional neural network (DEEPATT) for the task of SRL1. Our as Information Extraction (Bastianelli et al. 2013), Question models rely on the self-attention mechanism which directly Answering (Surdeanu et al. 2003; Moschitti, Morarescu, and draws the global dependencies of the inputs. In contrast to Harabagiu 2003; Dan and Lapata 2007), Machine Trans- RNNs, a major advantage of self-attention is that it con- lation (Knight and Luk 1994; Ueffing, Haffari, and Sarkar ducts direct connections between two arbitrary tokens in a 2007; Wu and Fung 2009) and Multi-document Abstractive sentence. Therefore, distant elements can interact with each Summarization (Genest and Lapalme 2011). other by shorter paths (O(1) v.s. O(n)), which allows unim- Semantic roles are closely related to syntax. Therefore, peded information flow through the network. Self-attention traditional SRL approaches rely heavily on the syntactic also provides a more flexible way to select, represent and structure of a sentence, which brings intrinsic complex- synthesize the information of the inputs and is complemen- ity and restrains these systems to be domain specific. Re- tary to RNN based models. Along with self-attention, DEEP- cently, end-to-end models for SRL without syntactic inputs ATT comes with three variants which uses recurrent (RNN), achieved promising results on this task (Zhou and Xu 2015; convolutional (CNN) and feed-forward (FFN) neural net- Marcheggiani, Frolov, and Titov 2017; He et al. 2017). work to further enhance the representations. As the pioneering work, Zhou and Xu (2015) introduced Although DEEPATT is fairly simple, it gives remarkable a stacked long short-term memory network (LSTM) and empirical results. Our single model outperforms the previ- ∗ Corresponding author. Copyright c 2018, Association for the Advancement of Artificial 1Our source code is available at https://github.com/XMUNLP/ Intelligence (www.aaai.org). All rights reserved. Tagger 4929 P P P P ous state-of-the-art systems on the CoNLL-2005 shared task (BARG0) (IARG0) (BV) (BARG1) dataset and the CoNLL-2012 shared task dataset by 1.8 and 1.0 1 F score respectively. It is also worth mentioning that Softmax on the out-of-domain dataset, we achieve an improvement upon the previous end-to-end approach (He et al. 2017) by Attention 2.0 F1 score. The feed-forward variant of DEEPATT allows Sub-Layer Self-Attention significantly more parallelization, and the parsing speed is 50K tokens per second on a single Titan X GPU. Nonlinear Sub-Layer RNN/CNN/FNN Semantic Role Labeling Given a sentence, the goal of SRL is to identify and clas- Repeat ... ... ... ... sify the arguments of each target verb into semantic roles. N − 2 times For example, for the sentence “Marry borrowed a book from Attention John last week.” and the target verb borrowed, SRL yields Sub-Layer Self-Attention the following outputs: [ARG0 Marry ][V borrowed ][ARG1 a book ] Nonlinear Sub-Layer RNN/CNN/FNN [ARG2 from John ][AM-TMP last week ]. Here ARG0 represents the borrower, ARG1 represents Word & the thing borrowed, ARG2 represents the entity borrowed Predicate from, AM-TMP is an adjunct indicating the timing of the action and V represents the verb. The 0 cats 0 love 1 hats 0 Generally, semantic role labeling consists of two steps: identifying and classifying arguments. The former step Figure 1: An illustration of our deep attentional neural involves assigning either a semantic argument or non- network. Original utterances and corresponding predicate argument for a given predicate, while the latter includes la- masks are taken as the only inputs for our deep model. For beling a specific semantic role for the identified argument. example, love is the predicate and marked as 1, while other It is also common to prune obvious non-candidates before words are marked as 0. the first step and to apply post-processing procedure to fix inconsistent predictions after the second step. Finally, a dy- namic programming algorithm is often applied to find the compute its representation. Self-attention has been success- global optimum solution for this typical sequence labeling fully applied to many tasks, including reading comprehen- problem at the inference stage. sion, abstractive summarization, textual entailment, learning In this paper, we treat SRL as a BIO tagging problem. Our task-independent sentence representations, machine transla- approach is extremely simple. As illustrated in Figure 1, the tion and language understanding (Cheng, Dong, and Lapata original utterances and the corresponding predicate masks 2016; Parikh et al. 2016; Lin et al. 2017; Paulus, Xiong, and are first projected into real-value vectors, namely embed- Socher 2017; Vaswani et al. 2017; Shen et al. 2017). dings, which are fed to the next layer. After that, we design In this paper, we adopt the multi-head attention formula- a deep attentional neural network which takes the embed- tion by Vaswani et al. (2017). Figure 2 depicts the compu- dings as the inputs to capture the nested structures of the tation graph of multi-head attention mechanism. The cen- sentence and the latent dependency relationships among the ter of the graph is the scaled dot-product attention, which is labels. On the inference stage, only the topmost outputs of a variant of dot-product (multiplicative) attention (Luong, attention sub-layer are taken to a logistic regression layer to Pham, and Manning 2015). Compared with the standard make the final decision 2. additive attention mechanism (Bahdanau, Cho, and Bengio 2014) which is implemented using a one layer feed-forward Deep Attentional Neural Network for SRL neural network, the dot-product attention utilizes matrix pro- duction which allows faster computation. Given a matrix of In this section, we will describe DEEPATT in detail. The n Q ∈ Rn×d K ∈ Rn×d N query vectors ,keys and values main component of our deep network consists of iden- V ∈ Rn×d, the scaled dot-product attention computes the tical layers. Each layer contains a nonlinear sub-layer fol- attention scores based on the following equation: lowed by an attentional sub-layer. The topmost layer is the softmax classification layer. QKT Attention(Q, K, V)=softmax( √ )V (1) d Self-Attention d Self-attention or intra-attention, is a special case of atten- where is the number of hidden units of our network. tion mechanism that only requires a single sequence to The multi-head attention mechanism first maps the matrix of input vectors X ∈ Rt×d to queries, keys and values ma- 2In case of BIO violations, we simply treat the argument of the trices by using different linear projections. Then h parallel B tags as the argument of the whole span. heads are employed to focus on different part of channels 4930 Q Softmax Linear Split Matmul K X Linear Split Matmul Concat Linear Y V Linear Split h parallel heads Figure 2: The computation graph of multi-head self-attention mechanism.

Load more