Syntax-Aware Neural Semantic Role Labeling∗

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Syntax-Aware Neural Semantic Role Labeling∗ Qingrong Xia,1 Zhenghua Li,1 Min Zhang,1 Meishan Zhang,2 Guohong Fu,2 Rui Wang,3 Luo Si3 1Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University, China 2School of Computer Science and Technology, Heilongjiang University, China, 3Alibaba Group, China [email protected], fzhli13, [email protected] [email protected], [email protected], 3fmasi.wr, [email protected] Abstract Root root Semantic role labeling (SRL), also known as shallow seman- punct tic parsing, is an important yet challenging task in NLP. Mo- nn nsubj dobj tivated by the close correlation between syntactic and se- . mantic structures, traditional discrete-feature-based SRL ap- Ms. Haag plays Elianti . proaches make heavy use of syntactic features. In contrast, A0 V A1 O deep-neural-network-based approaches usually encode the input sentence as a word sequence without considering the Figure 1: Example of dependency and SRL structures. syntactic structures. In this work, we investigate several previous approaches for encoding syntactic trees, and make a thorough study on whether extra syntax-aware representa- semantic head of the manually annotated span (Surdeanu et tions are beneficial for neural SRL models. Experiments on al. 2008). the benchmark CoNLL-2005 dataset show that syntax-aware SRL approaches can effectively improve performance over This work follows the span-based formulation. Formally, a strong baseline with external word representations from given an input sentence w = w1:::wn and a predicative word ELMo. With the extra syntax-aware representations, our ap- prd = wp (1 ≤ p ≤ n), the task is to recognize the seman- proaches achieve new state-of-the-art 85.6 F1 (single model) tic roles of prd in the sentence, such as A0, A1, AM-ADV, and 86.6 F1 (ensemble) on the test data, outperforming the etc. We denote the whole role set as R. Each role corre- corresponding strong baselines with ELMo by 0.8 and 1.0, re- sponds to a word span of wj:::wk (1 ≤ j ≤ k ≤ n). Taking spectively. Detailed error analysis are conducted to gain more Figure 1 as an example,“Ms. Hag” is the A0 role of the pred- insights on the investigated approaches. icate “plays”. In the past few years, thanks to the success of deep Introduction learning, researchers has proposed effective neural-network- based models and improved SRL performance by large mar- Semantic role labeling (SRL), also known as shallow se- gins (Zhou and Xu 2015; He et al. 2017; Tan et al. 2018). mantic parsing, is an important yet challenging task in NLP. Unlike traditional discrete-feature-based approaches that Given an input sentence and one or more predicates, SRL make heavy use of syntactic features, recent deep-neural- aims to determine the semantic roles of each predicate, network-based approaches are mostly in an end-to-end fash- i.e., who did what to whom, when and where, etc. Seman- ion and give little consideration of syntactic knowledge. tic knowledge has been proved informative in many down- Intuitively, syntax is strongly correlative with semantics. stream NLP applications, such as question answering (Shen Taking Figure 1 as an example, the A0 role in the SRL struc- and Lapata 2007; Wang et al. 2015), text summarization ture is also the subject (marked by nsubj) in the dependency (Genest and Lapalme 2011; Khan, Salim, and Jaya Ku- tree, and the A1 role is the direct object (marked by dobj). mar 2015), and machine translation (Liu and Gildea 2010; In fact, the semantic A0 or A1 argument of a verb predicate Gao and Vogel 2011). are usually the syntactic subject or object interchangeably Depending on how the semantic roles are defined, there according to the PropBank annotation guideline. are two forms of SRL in the community. The span-based In this work, we investigate several previous approaches SRL follows the manual annotations in the PropBank for encoding syntactic trees, and make a thorough study on (Palmer, Gildea, and Kingsbury 2005) and NomBank (Mey- whether extra syntax-aware representations are beneficial ers et al. 2004) and uses a continuous word span to be a se- for neural SRL models. The four approaches, Tree-GRU, mantic role. In contrast, the dependency-based SRL fulfills Shortest Dependency Path (SDP), Tree-based Position Fea- a role with a single word, which is usually the syntactic or ture (TPF), and Pattern Embedding (PE), try to encode use- ∗Zhenghua Li is the corresponding Author. ful syntactic information in the input dependency tree from Copyright c 2019, Association for the Advancement of Artificial different perspectives. Then, we use the encoded syntax- Intelligence (www.aaai.org). All rights reserved. aware representation vectors as extra input word representa- 7305 tions, requiring little change of the architecture of the basic BA0 IA0 BV SRL model. Viterbi Decoder For the base SRL model, we employ the recently proposed deep highway-BiLSTM model (He et al. 2017). Considering Classification that the quality of the parsing results has great impact on the performance of syntax-aware SRL models, we employ the state-of-the-art biaffine parser to parse all the data in our Encoder work, which achieves 94.3% labeled parsing accuracy on the WSJ test data (Dozat and Manning 2017). We conduct our experiments on the benchmark CoNLL- 2005 dataset, comparing our syntax-aware SRL approaches with a strong baseline with external word representations . Input syn syn syn from ELMo. Detailed error analyses also give us more in- Ms. 0 x Haag 0 x plays 1 x sights on the investigated approaches. The results show that, 1 2 3 with the extra syntax-aware representations, our approach Figure 2: The basic SRL architecture. achieves new state-of-the-art 85.6 F1 (single model) and 86.6 F1 (ensemble) on the test set, outperforming the corresponding strong baselines with ELMo by 0.8 and 1.0, re- Moreover, He et al. (2017) propose to use highway con- spectively, demonstrating the usefulness of syntactic knowl- nections (Srivastava, Greff, and Schmidhuber 2015; Zhang edge. et al. 2016) to alleviate the vanishing gradient problem, im- proving the parsing performance by 2% F1. As illustrated in The Basic SRL Architecture Figure 2, the basic idea is to combine the input and output of Following previous works (Zhou and Xu 2015; He et an LSTM node in some way, and feed the combined result al. 2017; Tan et al. 2018), we also treat the task as a se- as the final output of the node into the next LSTM layer and quence labeling problem and try to find the highest-scoring the next time-stamp of the same LSTM layer. tag sequence y^. We use the outputs of the final (top) backward LSTM layer as the representation of each word, denoted as hi. y^ = argmax score(w; y) (1) y2Y(w) Classification Layer With the representation vector hi of the word wi, we employ a linear transformation and a soft- 0 where yi 2 R is the tag of the i-th word wi, and Y(w) is the max operation to compute the probability distribution of dif- 0 set of all legal sequences. Please note that R = (fB; Ig × ferent tags, denoted as p(rjw; i) (r 2 R0). R) [ fOg. In order to compute score(w; y), which is the score of a Decoder With the local tag probabilities of each word, tag sequence y for w, we directly adopt the architecture of then the score of a tag sequence is He et al. (2017), which consists of the following four com- n X ponents, as illustrated in Figure 2. score(w; y) = log p(yijw; i) (3) i=1 The Input Layer Given the sentence w = w1:::wn and the predicate prd = wp, the input of the network is the com- Finally, we employ the Viterbi algorithm to find the bination of the word embeddings and the predicate-indicator highest-scoring tag sequence and ensure the resulting se- embeddings. Specifically, the input vector at the i-th time quence does not contain illegal tag transitions such as stamp is yi−1 = BA0 and yi = IA1. x = embword ⊕ embprd i wi i==p (2) prd The Syntax-aware SRL Approaches where the predicate-indicator embedding emb0 is used prd The previous section introduces the basic model architecture for non-predicate positions and emb1 is used for the p- th position, in order to distinguish the predicate word from of SRL, and in this section, we will illustrate how to encode other words, as shown in Figure 2. the syntactic features into a dense vector and use it as extra With the predicate-indicator embedding, the encoder inputs. Intuitively, dependency syntax has a strong correla- component can represent the sentence in a predicate-specific tion with semantics. For instance, a subject of a verb in de- way, leading to superior performance (Zhou and Xu 2015; pendency trees usually corresponds to the agent or patient of He et al. 2017; Tan et al. 2018). However, the side effect is the verb. Therefore, traditional discrete-feature based SRL that we need to separately encode the sentence for each pred- approaches make heavy use of syntax-related features. In icate, dramatically slowing down training and evaluation. contrast, the state-of-the-art neural network based SRL models usually adopt the end-to-end framework without consult- The BiLSTM Encoding Layer Over the input layer, four ing the syntax. stacking layers of BiLSTMs are applied to fully encode This work tries to make a thorough investigation on long-distance dependencies in the sentence and obtain the whether integrating syntactic knowledge is beneficial for rich predicate-specific token-level representations.

Load more