PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference

Encoder-Decoder Based Unified Semantic Role Labeling with Label-Aware Syntax

Hao Fei, Fei Li, Bobo Li, Donghong Ji* Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan, China {hao.fei, boboli, dhji}@whu.edu.cn, [email protected]

Abstract A0 AM-LOC A0 AM-GOL A1 Currently the unified semantic role labeling (SRL) that you can click here to view it online . achieves identification and argument role labeling aux advmod mark obj in an end-to-end manner has received growing interests. Re- nsubj advcl advmod cent works show that leveraging the syntax knowledge sig- Root nificantly enhances the SRL performances. In this paper, we investigate a novel unified SRL framework based on the Figure 1: An example of unified SRL. Above/below are the sequence-to-sequence architecture with double enhancement in both the encoder and decoder sides. In the encoder side, SRL/dependency structures, respectively. Same color refers we propose a novel label-aware graph convolutional network to the propositions under the same predicate. (LA-GCN) to encode both the syntactic dependent arcs and labels into BERT-based representations. In the decoder side, we creatively design a pointer-network-based model for propagation in the pipeline scheme, meanwhile achieving an detecting predicates, arguments and roles jointly. Our pointer- net decoder is able to make decisions by consulting all the in- overall simplified procedure. put elements in a global view, and meanwhile it is syntactic- Moreover, many prior work has revealed that syntax fea- aware by incorporating the syntax information from LA- tures are extraordinarily effective for SRL (Roth and Lap- GCN. Besides, a high-order interacted attention is introduced ata 2016; Marcheggiani and Titov 2017; Zhang, Wang, and into the decoder for leveraging previously recognized triplets Si 2019). Intuitively, the predicate–argument structure in to help the current decision. Empirical experiments show that SRL shares many common connections with the underlying our framework significantly outperforms all existing graph- syntactic structure (i.e., dependency tree), as illustrated in based methods on the CoNLL09 and Universal Proposition Bank datasets. In-depth analysis demonstrates that our model Figure 1. Nevertheless, most of prior syntactic dependency- can effectively capture the correlations between syntactic and aware SRL models are limited to the merely use of de- SRL structures. pendency arcs (e.g., clickyyou), while neglecting depen- dency labels (e.g., nsubj) (Jin, Kawahara, and Kurohashi 2015; Roth and Lapata 2016; Strubell et al. 2018; Xia et al. 1 Introduction 2019). Actually, dependency labels carry crucial evidences Semantic role labeling (SRL), as a shallow semantic pars- for SRL, since the information from the neighboring nodes ing for extracting the predicate–argument structure in sen- under distinct types of arcs contributes in different degrees tences, has long been a fundamental natural language pro- (Kasai et al. 2019). In Figure 1, dependent arcs with nomi- cessing (NLP) task (Gildea and Jurafsky 2000; Pradhan et al. nal attributes (e.g., nsubj, obj) are more associated with core 2005; Zhao et al. 2009; Fei, Ren, and Ji 2020). Traditional roles (e.g., A0-A5) in SRL, while the arcs with modifying SRL is divided into two pipeline subtasks: predicate identifi- attributes (e.g., advmod) relate more to modifier roles (e.g., cation (Scheible 2010) and argument role labeling (Pradhan AM-*). On this basis, another line of works has considered et al. 2005). Recently, great efforts have been paid for con- both dependent arcs and labels via graph convolutional net- structing unified SRL systems, solving two pipeline steps in works (GCN) (Marcheggiani and Titov 2017). one shot via one unified model (He et al. 2018), as illus- In this work, we propose a novel unified SRL framework, trated in Figure 1. Most current unified SRL models employ which is fully orthogonal to the existing graph-based meth- graph-based models by exhaustively enumerating all poten- ods. The system leverages the encoder-decoder architecture tial predicates and their corresponding arguments jointly (He (cf. Figure 2), being able to jointly produce all the possi- et al. 2018; Cai et al. 2018; Li et al. 2019). Graph-based ble predicate-argument-role triplets. In the encoder side, we models improve the SRL performances by reducing the error newly propose a label-aware GCN (namely LA-GCN) for *Corresponding author encoding word representations, dependency arcs and labels Copyright © 2021, Association for the Advancement of Artificial (cf. Figure 2 left). The syntax GCN encoders in existing Intelligence (www.aaai.org). All rights reserved. SRL works (Marcheggiani and Titov 2017) model depen- aesadsmni oelbl.A at esmaieour summarize we last, At labels. role dependent semantic between and correlations model labels the our capture that effectively demonstrates can proposed analysis each compre- Further of for mechanism. contribution performed argument the are and understanding studies identification hensively Ablation predicate labeling. the both role of SRL unified terms on in that and performances significantly, state-of-the-art show baselines new results achieves all Experimental outperforms framework languages. our 2016) eight Li and total Akbik 2015; for al. et (Akbik Bank Proposition sal (Haji English CoNLL09 including 4). Figure (cf. LA-GCN syntax- the be from dependency to yielded information previously pointer incorporating the fully render by we aware meantime, Fig- the (cf. At information 4). high-order ure as the triplets atten- with recognized fused interacted is prior input high-order decoding First, a each where side. with mechanism, decoder tion pointer the the in improve we enhancements two propose ther determined the la- for the predicate roles meanwhile semantic The and assigns 2019). fashion, beler in Shang close-first predicate and Man- each incremental, Ye, for and Li, an arguments Liu, possible 2018; all (See, al. detects scope et decoder global Ma a 2017; in ning all of elements consultant input the in the decisions make etc. al. can 2019), et networks Shang Pointer and (Ma Ye, summarization (Li, text extraction syntactic as information 2017), 2018), such Manning tasks, exploited and NLP been Liu, of have (See, step range networks wide decoding input Pointer a each the right). for at from 2 probability Figure element highest (cf. an the picks with Fortunato, which (Vinyals, tokens 2015), network pointer Jaitly the and on based SRL for (cf. 3). connecting-strengths Figure as normalizes distributions and into simultaneously, unitedly labels them and for- encoder arcs LA-GCN the our prob- malizes while and re-harnessed, inefficient be be to could lematic Such method parameters. modeling GCN of implicit i.e, set an a parameters, owns GCN label label-specific dependency labeler.each creating role by the by labels given dent be will label role its and generated ‘ be token will sentinel pair a predicate-argument of a consists word, input real The SRL. unified (idx=2 for framework sequence-to-sequence Our 2: Figure ecnuteautoso w R ecmr datasets, benchmark SRL two on evaluations conduct We system decoding novel a design we side, decoder the In Encoder Module Representation LA Encoder Encoder BERT Input - GCN ar.Bsdstepitrbsddcdn,w fur- we decoding, pointer-based the Besides pairs. ∼ ) uigdcdn,i h one iet to directs pointer the if decoding, During 9). ReLU 1 + ReLU 2 you + ReLU 3 can + ReLU 4 click + ta.20) n Univer- and 2009), al. et c ˇ ReLU 5 here + ReLU 6 to + ReLU 7 view + ReLU 8 +

argument- it ReLU 9 online + < 1 S } , , you , , { lc eecikyuve tve nieve you view online view it view you click here click > 1 h eoe ilg onx od ftepitrdrcst a to directs pointer the If word. next to go will decoder the , ig21) ytci asn M ta.21;Fern 2018; al. et (Ma Man- parsing and Liu, syntactic (See, 2017), summarization 2015). ning text many as Jaitly for such exploited tasks, and extensively NLP been have Fortunato, networks Lei (Vinyals, Pointer 2019; networks al. et pointer Li 2018; al. et Cai 2015). 2018; al. roles et al. the as et well (He as arguments, em- labels and SRL predicate enumerating possible unified the exhaustively current all model, the neural one All graph-based the by SRL. ploys for subtasks unified two paid i.e., both are model, handles efforts that Recent solution 2018). Titov end-to-end and al. Marcheggiani et 2016; Strubell (FitzGerald Lapata neural and 2017; Yih features Roth of 2015; and distributed al. advantages et Roth, automatic take Punyakanok, with researchers networks 2005; later classi- al. while learning 2008), et em- machine works (Pradhan with Earlier fiers features oracle. hand-crafted predicate based ploy pre-identified labeling role the la- argument on role conduct argument mainly and They sub- disambiguation beling. pipeline predicate two i.e., into SRL tasks, Prior by the scheme. separate unified pioneered traditionally the works and (SRL) two scheme into pipeline grouped labeling the roughly be schemes: role can (2000) semantic Jurafsky and Gildea of task The syntactic between our correlations structures. the that SRL and capture uncovers well analysis can In-depth model benchmarks. SRL on re- by the syntax-aware from be distribution GCN. to dependency label-aware syntactic pointer the the simultane- harnessing render labels We and arcs ously. dependency syntactic modeling at- interacted previously high-order mechanism. proposed with tention a interact via to at triplets encouraged pointer recognized The is networks. decision pointer current on based framework SRL below: as study this in contributions can AM MGLAAA O A LOC AM A A GOL AM u oki locoeyrltdt h plcto of application the to related closely also is work Our I I I −− - 5 click GOL u rmwr ie e tt-fteatperformances state-of-the-art new gives framework Our oorkolde eaetefis opeetaunified a present to first the are we knowledge, our To eitoueanvllblaaeGNecdrfor encoder GCN label-aware novel a introduce We A0 click 2

click 1 0 1 0 here 1 eae Work Related 2

to 1 A1 8

view output AM - 9 view LOC A0 view 2 < 1

S view Decoder Module > 1 ix1 n the and (idx=1) ’ it

online 1 Decoder Labeler Pointer LSTM Role andez- ´ Gonzalez´ and Gomez-Rodr´ ´ıguez 2020), information extrac- l l l l ... h6 h7 h8 h9 tion (Li, Ye, and Shang 2019), etc. On the one hand, based ... ReLU ...... on encoder-decoder architecture (Sutskever, Vinyals, and Le ∑ 2014), pointer network is able to yield results with lower

l l model complexity. Besides, pointer network can make de- 76,   l  l 7,9 cisions in the consultant of all the input elements in global 7,7 7,8 Add& Add& Add& Add& viewpoint (See, Liu, and Manning 2017; Ma et al. 2018; Li, Norm Norm Norm Norm Ye, and Shang 2019). In this work, we first present a pointer network based solution for end-to-end SRL, as an alternative to the current graph-based methods. Linear Linear Linear Linear In addition, our work also falls into syntax-aware SRL. The syntactic features, e.g., dependency structures, have long been proven effective for SRL tasks (Marcheggiani, x7,6 x7,7 x7,8 x7,9 Frolov, and Titov 2017; Zhang, Wang, and Si 2019; Fei et al. 2020). Yet most prior work merely utilized the dependent hl−1 hl−1 hl−1 hl−1 arcs, leaving the dependent type information unexploited, 6 7 8 9 which however is equally crucial for the task. On the other hand, some consider encoding such dependent labels using ... to view it online mark obj GCN model (Marcheggiani and Titov 2017), i.e., by creating ... self advmod the label-specific GCN parameters. Unfortunately, such im- plicit integration of dependent labels could be less effective, Figure 3: Illustration to show the process of generating the since intuitively the dependent arcs and labels are closely encoder representation h7 for the 7-th word ‘view’ in Figure related, and should be modeled in a more unified manner. 2, using our proposed label-aware GCN. In this paper, we propose a label-aware GCN encoder for effectively modeling these two syntax information. We also improve our framework with a high-order interacted atten- We then take the position Pt with the maximum ot,i over tion mechanism and a syntactic-aware pointer mechanism, each input token w as the output of t-th decoding: which can further facilitate the procedure. i Pt = Argmax(ot,i) . (2) i 3 Preliminary Each pointing decision is made in the consultant of all input SRL Formalization Following the current line of unified tokens, which can be seen as a type of global interaction. SRL studies (He et al. 2018; Cai et al. 2018; Li et al. 2019), we model the task as predicate-argument-role triplet pre- 4 Our Framework for Unified SRL diction. Given an input sentence S = {w1, ··· , wn}, our framework predicts a set of triplets Y = {· · · ,

k k k Overview In Figure 2, we give a overall picture of our , · · · |p ∈ P, a ∈ A, r ∈ R}, where P , A and R are all k k k framework. In the encoder module, we first employ the possible predicate/argument tokens, and role labels. In Fig- pre-trained BERT language model (Devlin et al. 2019) to ure 2, we show the format of the final outputs for the corre- provide word representations for input tokens. Then, the sponding input sentence. In this work, we mainly consider LA-GCN fuses rich syntax features and learn contextual- the dependency-based SRL, detecting the dependency head ized word representations. Finally, we take the shortcut con- of each argument (Li et al. 2019). Note that our framework nected representations between BERT representations and is flexible to be extended to the span-based SRL, by further LA-GCN representations as the final encoder representa- predicting the end-boundary for an argument. tions. As shown in Figure 2, a sentinel token ‘’ is in- serted at the head of the input sentence, which informs the Pointer Network The pointer mechanism functions by system that the current decoding input (maybe predicate) learning the conditional probability of an output at the posi- has no further argument. tion corresponding to an input token. The vanilla pointer net- In the decoder module, the decoder first takes the encoder work employ the attention mechanism (Bahdanau, Cho, and representation ej of the input token wj, and outputs the cor- Bengio 2015; Luong, Pham, and Manning 2015) to accom- responding decoding representation st at each step t. Then plish the selecting target from the input sequence.1 Tech- the pointer will direct to a position of a input token wi, and if nically, given the encoding representations of input tokens wi is not ‘’, the decoder will generate a predicate(wj)- [e1, ··· , en], and the current decoding representation st, we argument(wi) pair. Further, the labeler assigns the role label calculate the relatedness between st and each ei and normal- for the pair. Note that the decoder takes tokens sequentially ize it over the inputs as below: in the order of the input, but if a token wj is determined as vt,i = Score(st, ei) = Tanh(W0[st; ei] + b) , a predicate, it will be re-input at next decoding step, until no (1) ot,i = Softmax(vt,i) , i = [1, ··· , n] . further argument is detected. The decoding process termi- nates once all the input tokens are examined, and meantime 1We denote the vanilla pointer mechanism as base pointer. it outputs all the triplets Y . Encoder Module 9 Syntactic-aware Pointer Word Representation We employ BERT (Vaswani et al. b Pointing r * 2017) to yield our contextualized word representation i for Score(,) s e w ti each token i: 8 e1 b b  i,1 {r , ··· , r } = BERT({x , ··· , x }), (3) s * e ‡  ... 1 n 1 n ... t ei = i + ei  ij, b e j where ri is the BERT output representation. We concatenate ... word-piece embeddings and position embeddings, i.e., xi = ... t-1 t w P xi ⊕ xi , as the BERT input embeddings. trpl

... ut r1 = 1 trpl   2 r Label-Aware GCN We propose a label-aware GCN (LA- view e † 2 7 + e7   ... GCN) encoder for simultaneously modeling the dependency k trpl rk arcs and labels (Figure 3). We denote ai,j=1 if there is an view y arc between wi and wj, and ai,j=0 vice versa. πi,j represents High-order Interacted Attention the dependency label between wi and wj. Besides of the pre- trpl defined dependency labels, we additionally add a ‘self ’ label Figurer 4: Illustration: Previously to recognized show the triplet input representation and output for the y k as the self-loop arc πi,i for wi, and a ‘none’ label for repre- decoding step at ‘view’ in Figure 2. t →the current decod- senting no arc between wi and wj. We use the embedding ing step, i →the index of an input word, 7 →the index of y y xi,j for the label πi,j. Our LA-GCN consists of L layers, trpl the current input word, rk →previously recognized triplet and we denote the resulting hidden representation of wi at (l) representations. l-th layer as hi : n h(l) = (P β(l)(W ·h(l−1) +W ·xy +b)) , i ReLU j=1 i,j 1 j 2 i,j (4) † (l) High-Order Interacted Attention The representation e? where βi,j is the neighbor connecting-strength distribution: from the high-order interacted attention can be calculated as: n n k trpl ai,j · exp (r · r ) † P (l) i j e? = j=1αjrj , βi,j = Pn n n , (5) ai,j · exp (r · r ) j=1 i j αj = Softmax(µj) , (8) (l−1) where rn is the element-wise addition by h + xy . The trpl i i i,j µj = Tanh(W3rj + W4e? + b) . neighbor connecting-strength distribution βi,j encodes both the information from the dependent arcs and labels, and thus By interacting with previously recognized triplets, the cur- comprehensively reflects the syntactic attributes. The initial rent decision will capture the higher-order information. Oth- (0) erwise, the decoding process for detecting the argument for input representation of the first layer LA-GCN is hi = a predicate is restricted to local features, which can be seen b pos pos ri ⊕ xi , where xi is the POS tag embedding for wi. We as a first-order model. Taking the example in Figure 4, for note that after L layers of information aggregation, much the predicate ‘view’, being informed by the prior triplet, e.g., high-order information between neighbors in the graph can ‘view-it-A1’, the pointer tends to assign other token as argu- fully retained, which will facilitate the SRL. ment rather than ‘it’ repeatedly. We then impose a residual connection between BERT and the last layer of LA-GCN. The final encoder representations are the concatenation of the two representations from two Syntax-Aware Pointer Mechanism Based on the decod- encoders respectively, which ensures the minimum informa- ing representation st, we can obtain the pointing result via tion loss from the BERT contextualized source. the base pointer in Eq. (1). We now consider making full use of the previously yielded syntactic information in LA-GCN, e = h(L) ⊕ rb . i i i (6) i.e., rendering the pointer to be syntactic-aware. Specifically, for each encoder representation ei, we calculate its corre- Decoder Module ‡ sponding syntactic-aware counterpart (denoted as ei ): We employ the LSTM (Hochreiter and Schmidhuber 1997) e‡ = P β(L)e , (9) as our decoder. Suppose the system has already confirmed i j6=i i,j j 0 (L) several triplets Y = {· · · , }, with the corre- where βi,j is the syntax connecting distribution at the last trpl ‡ sponding representations rk for these established triplets layer of LA-GCN. We then concatenate ei and ei as the ∗ (elaborated later in Eq. 12), at the decoding step t. The de- syntax-aware representation ei , which together with st is † coding input representation is ut = e? ⊕ e?, where e? is fed into the pointer in Eq. (1) for producing the pointing the corresponding encoder representation of the ?-th input position. The reuse of βi,j can be understood as a way to † further leverage the second-order syntactic structures con- token (from Eq. 6), and e? is a high-order representation via the high-order interacted attention that we will introduce in cerning the current ei for the pointer. the next sub-section. Then, the LSTM cell produces the de- coding representation based on ut: Role Labeler Once a predicate-argument pair has been st = LSTMCell(ut) . (7) determined, the role labler will assign a role for the In-domain (WSJ) Out-of-domain (Brown) P R F1 Prd P R F1 Prd He et al. (2018) LP-GCN 89.4 88.7 89.4 93.1 79.8 78.1 79.0 79.6 LP-GCN 90.2 89.3 89.9 94.9 80.4 78.9 79.5 80.5 Cai et al. (2018) GCN∗ 89.7 89.0 89.5 94.2 80.0 78.3 79.0 80.0 LP-GCN 90.2 89.5 89.9 95.2 79.9 78.5 79.2 81.5 Li et al. (2019) Without BERT GCN∗ 89.8 89.1 89.4 94.7 79.4 78.2 78.7 80.7 LA-GCN 90.8 90.0 90.4 95.5 80.7 79.3 80.2 81.8 Ours LP-GCN 90.6 89.6 90.2 95.4 80.5 79.0 79.7 81.6 GCN∗ 89.6 89.0 89.3 94.5 79.0 78.3 78.6 80.2 LP-GCN 91.2 92.0 91.7 95.7 82.1 82.8 83.9 85.6 Cai et al. (2018) LA-GCN 91.0 92.3 91.9 95.8 82.5 83.1 84.3 85.8 With BERT LP-GCN 91.6 92.2 92.0 95.5 84.7 84.4 84.6 87.0 Li et al. (2019) LA-GCN 92.0 92.4 92.2 95.9 85.4 85.1 85.2 87.8 Ours LA-GCN 92.5 92.5 92.5 96.2 85.6 85.3 85.4 90.0

Table 1: Unified SRL results on CoNLL09 English data. predicate-argument pair via a Biaffine layer (Dozat and ing. Also there is the role label loss: Manning 2017): PM Lrole = − pˆk log pk , (14) lb p a p a k r = g1(r )W5g2(r ) + U1r + U2r + b , (10) where M is the total number of triplets, pˆk is the k-th gold lb pr = Softmax(r ) , role label. We summarize them into total loss: p a λ 2 where g1(·) and g2(·) are two feedforward layers, r and r L = Lpointer + Lrole + ||Θ|| , (15) are the predicate and argument representations, given by: 2 where λ is a regularization for the ` norm term, Θ is the rp = ep ⊕ s , ra = ea ⊕ s , 2 (11) overall parameters. where ep and ea are the corresponding encoder represen- tations (cf. Eq. 6) of the predicate and argument, and s is 5 Experiment and Evaluation the decoding representation (cf. Eq. 7). For each established Setup triplet , we construct the corresponding repre- trpl We train and evaluate all models on two SRL benchmarks, sentations rk by applying element-wise addition on the a p lb including CoNLL09 (English), and Universal Proposition representations rk, rk and rk of the corresponding predi- cate, argument and role: Bank (eight languages). We employ the official training, de- trpl a p lb velopment and test sets in each dataset. The gold-standard rk = rk + rk + rk , (12) syntactic features (POS tags and dependency trees) are also trpl where rk is used in the high-order interacted attention. offered. In terms of hyper-parameters, since BERT is used, the size of word representations is 768.2 The size of POS tag Learning embeddings is 50. We use a 3-layer LA-GCN with 350 hid- As described earlier, each decoding input in our frame- den units, and the output size of LSTM decoder is 300. We work depends on the last prediction. Following previous adopt the Adam optimizer with an initial learning rate 2e-5, encoder-decoder studies, we adopt the teacher-forcing strat- mini-batch size 16 and regularization weight 0.12. egy (Williams and Zipser 1989) during training, maximizing We consider two SRL setups: unified and pipeline SRL. the likelihood of the oracle pointer at each decoding frame. For the unified SRL, we make comparisons with: He et al. Besides, we change the order of oracles during learning so (2018), Cai et al. (2018), Li et al. (2019), all of which is that the process of detecting the arguments for a predicate is graph-based neural models. We equip all the baselines with in a close-first manner. For example, for the predicate ‘view’ a label-parameter GCN encoder (denoted as LP-GCN) for in Figure 1, the outputs will follow the from-near-to-far or- modeling syntactic dependency features (including arcs and der: ‘it→online→you’, instead of ‘you→it→online’ or other labels). We also equip them with the vanilla GCN which cases. This is more intuitive compared with the traditional does not encode dependency labels (denoted as GCN∗). For reading order (from-left-to-right), as humans always tend to the pipeline SRL, we use gold predicates for argument role first grasp the core ideas then dig into more details. Also this labeling. We use precision (P), recall (R) and F1 scores for facilitates the high-order interacted attention mechanism. measuring the argument role labeling (Arg). For predicate The training target is to minimize the cross-entropy loss disambiguation (Prd), only F1 is presented. All models are between the predicted pointer distribution and the gold one: trained and evaluated 5 times and the averaged values are PN reported. Lpointer = − t oˆt log ot , (13) 2 where N is the total decoding steps and oˆt is the gold point- https://github.com/google-research/bert, base-cased version. EN DE FR IT ES PT FI ZH Avg. He et al. (2018) 87.0 80.0 89.6 78.5 79.4 80.7 78.2 71.6 80.6 Cai et al. (2018) 88.3 80.9 90.6 79.8 80.1 81.8 78.1 74.8 81.9 Li et al. (2019) 90.6 82.2 92.0 81.3 82.6 85.1 80.7 75.1 83.6 Ours 91.5 84.5 93.8 80.7 84.0 85.0 82.1 76.8 84.9

Table 2: F1s of argument role labeling for unified SRL on UPB. All models use BERT and the LA-GCN encoder.

P R F1 Arg Prd Pipeline Model Ours 93.2 97.4 Zhao et al. (2009) - - 85.4∗ w/o POS 92.6 97.0 Bjorkelund¨ et al. (2010) 87.1∗ 84.5∗ 85.8∗ w/o BERT 90.8 95.7 FitzGerald et al. (2015) - - 86.7∗ w/o LA-GCN 91.2 95.9 ∗ Roth and Lapata (2016) 88.1∗ 85.3∗ 86.7∗ w/o dependent label (GCN ) 91.9 96.2 Marcheggiani and Titov (2017) 89.1∗ 86.8∗ 88.0∗ LP-GCN 92.7 97.1 Unified Model w/o Syntactic-aware Pointer 92.2 96.9 He et al. (2018) 89.8 89.6 89.7 w/o HI Attention (Eq. 8) 92.0 96.5 Cai et al. (2018) 89.9 90.2 90.0 base pointer (Eq. 1) 91.7 96.1 Li et al. (2019) +LA-GCN +BERT 92.3 92.6 92.4 left-to-right parsing scheme 92.5 97.2 Ours 92.8 92.9 92.8 w/o residual connection (Eq. 6) 91.9 96.4

Table 3: Argument role labeling with gold predicates on Table 4: Ablation results (F1) on CoNLL09. CoNLL09. Values with ∗ are from prior papers.

on CoNLL09 (WSJ) data. Compared with the traditional Result and Analysis pipeline baselines, the unified systems can perform better. Results on CoNLL09 English In Table 1, we have the fol- More importantly, with the help of LA-GCN encoder and lowing observations: First, all the models with the BERT en- BERT, our framework outperforms two best baselines, i.e., hanced representations are much stronger than those with- 92.8% F1, maintaining the advances on the standalone argu- out BERT. This is consistent with prior BERT-based stud- ment role labeling. ies (He, Li, and Zhao 2019; Fei, Zhang, and Ji 2020). Sec- ond, the methods when modeling both syntactic dependent labels and arcs (i.e., LP-GCN and LA-GCN) consistently out- Ablation Study In Table 4, we list the influences of dif- perform those models using only dependent arcs (i.e., GCN∗). ferent proposed mechanisms on our model, based on the de- Meanwhile, our proposed LA-GCN encoder presents quite velopment set of CoNLL09 (WSJ). For the input features, improved results, compared with the LP-GCN encoder. the BERT representations are of the most prominent help- Most importantly, our framework significantly outper- fulness. The dependency features (both arcs and labels) also forms all the baselines on two test sets for both argument show significant impacts, from the remove of the LA-GCN. role labeling and predicate disambiguation. For example, by Looking into the dependency features, either only decoding equipped with LP-GCN under no BERT representations, ours the dependent arcs (via GCN∗) or replacing our LA-GCN with wins 0.3%(90.2-89.9) and 0.5%(79.7-79.2) F1 gaps than Li LP-GCN, the performances will drop. et al. (2019) for ‘Arg’ on WSJ and Brown respectively. With Besides, both the syntactic-aware pointer mechanism and the LA-GCN, our model can further increase the wining gaps the high-order interacted attention mechanism plays impor- over baselines. The improvements largely lie in that the pro- tant roles for the resulting framework. Especially when both posed syntactic-aware pointer mechanism is available when two mechanisms are unavailable (degraded into the base equipped with LA-GCN encoder, which additionally brings pointer), the performance decreases crucially. Finally, the the enhancements. Also with our proposed LA-GCN plus results fall if replacing the close-first parsing scheme with BERT, the advantages can still be retained. a vanilla left-to-right direction. Also canceling the resid- ual connection of BERT representation results in substantial Results on UPB Table 2 shows the comparisons between drops, as the utilities from rich BERT contexts are hurt. different models on UPB. Above all, our pointer network wins the best overall results against all the graph-based base- Improvement for Argument Roles In Table 5, we present lines, with an average F1 score (84.9%). The improvements the results of high-frequency arguments on the UPB English from our model demonstrate its effectiveness on unified SRL test set. We compare with the best baseline Li et al. (2019) over all strong baselines. under different GCN encoders. We note that when our model takes LP-GCN encoder, the syntactic-aware pointer mecha- Argument Role Labeling with Gold Predicates In Ta- nism is unusable. Two models both use BERT representa- ble 3, we show the performances for pipeline SRL based tions for fair comparisons. ute otiue oteprocess. the mechanism to pointer contributes syntactic-aware further the la- with distribu- combined and connecting-strength tion syntax arcs the yielded dependent the to the where thank bels, modeling also unifiedly We for labeling. encoder structure role dependency semantic syntactic collab- and the between on network learning pointer oratively our of capability stronger the such significant, more are improvements as arguments the occurred that less find those we for full Especially labels. our role prominently, the Most all (2019). al. (with et av- framework weighted Li 72.5% than (i.e., F1) scores erage better under overall (2019) the achieves al. same et the can Li using which Next, the improvements, by receive learnt roles be argument all tion, English. UPB on roles argument and v1.4) Dependency (Universal labels dependency between Correlations 5: Figure al :Ipoeetfragmn oe nUBEnglish. UPB on roles argument for Improvement 5: Table is fal hnmdln h eedn yeinforma- type dependent the modeling when all, of First A3 egtdAvg. Weighted R-A1 R-A0 C-A1 AM-TMP AM-PRR AM-NEG AM-MOD AM-MNR AM-LOC AM-DIS AM-DIR AM-ADJ AM-ADV A4 A3 A2 A1 A0 ,

A4 Argument role labels AM-MOD AM-MNR AM-ADV AM-TMP AM-NEG AM-LOC AM-ADJ AM-DIR , AM-ADJ A4 A3 A2 A1 A0 LA-GCN nmod:poss GCN 997. 72.5 67.3 83.2 71.8 63.1 62.2 79.3 69.9 82.5 91.6 60.5 61.9 93.1 80.6 77.6 96.8 56.4 90.5 57.4 76.0 92.0 50.0 89.0 96.3 67.5 89.4 53.3 38.8 95.1 50.4 72.4 51.2 69.2 57.8 49.6 36.0 52.3 67.9 71.6 34.1 34.7 56.7 84.1 70.2 53.4 89.0 55.3 32.0 87.2 50.0 83.2 21.8 88.7 82.9 86.3 87.4 85.6 ie l 21)Ours (2019) al. et Li case LP-GCN and nsubj:pass ∗

a anteipoeet from improvements the gain can ) mark AM-MNR

PGNL-C LA-GCN LP-GCN LP-GCN acl det

noe,orpitrnetwork pointer our encoder, obl:npmod punct amod egv h rdtto credit the give We . obj nummod

GCN aux

∗ aux:pass

and root

73.3 71.9 84.5 64.2 82.0 94.3 96.0 97.3 62.4 52.0 70.1 40.3 75.4 58.2 60.3 41.2 85.0 89.8 88.3 advmod

LP-GCN fixed LA-GCN obl nmod

cc Dependency labels conj . nsubj acl:relcl ccomp ec aesadsmni roles. depen- semantic our and between labels correlations that dency the demonstrates captures Universal effectively analysis and model Further CoNLL09 Bank. on models Proposition exist- SRL outperforms graph-based from significantly ing yielded framework information Our incor- dependent LA-GCN. fully syntactic Lastly, which the decision. proposed porated was current pointer the leveraging syntactic-aware into a for triplets introduced inter- recognized high-order was prior a mechanism Besides, manner. attention unified arcs (LA- acted a dependent GCN in syntactic labels label-aware both and modeling a for encoder proposed GCN) pointer further the on We based framework network. SRL unified a proposed We labels, role minority those on as improve- model such prominent baseline achieve the still than ex- can ment This model accurately. our correlations frame- why such the plains master that to find we tends Besides, work works structures. rely that future syntactic label the the role on semantic for unsupervised foundation of direction crucial the on a provide may ing A4 these ( tokens argument collect and predicate first the we concerning 10), (i.e., Eq. weights syntactic (by the label role argument the ceives our a by when discovered Technically, are framework. which roles, semantic and labels dency Semantic and Roles Labels Dependency between Correlations ahsmni oe nyasalsbe fdpnetlabels dependent the of example, subset For small for contribute. a First, will only observed. role, be semantic can each patterns interesting some Specif- visualizations. ically, diversified the and from structure structure SRL syntactic the the between inter-dependency lying 5. Figure in labels roles dependency semantic and between correlations distri- the into nor- visualize label We then dependency bution. each we over role, weights argument these malize each For respectively. 11), Eq. cop flat ecnlanta u rmwr a atrdteunder- the captured has framework our that learn can We

oeams eed nthe on depends almost role appos

obj vocative

A4 discourse eew netgt h orltosbtendepen- between correlations the investigate we Here , det:predet nsubj ,

AM-TMP advcl csubj

, xcomp

cop parataxis goeswith Conclusion 6 etc. ,

and cc:preconj iobj compound:prt csubj

β obl:tmod i,j (

L expl ) fec eednylabels dependency each of ) nmod orphan eedn aes hl the while labels, dependent predicate-argument

A1 list nmod:tmod

oemjryrltsto relates majorly role nmod:npmod ae ny uhfind- Such only. label csubj:pass reparandum flat:foreign dislocated

e dep p and 0.0 0.2 0.4 0.6 0.8 1.0 arre- pair e a in Acknowledgments Y. 2009. The CoNLL-2009 Shared Task: Syntactic and Se- This work is supported by the National Natural Sci- mantic Dependencies in Multiple Languages. In Proceed- ence Foundation of China (No. 61772378), the National ings of the CoNLL, 1–18. Key Research and Development Program of China (No. He, L.; Lee, K.; Levy, O.; and Zettlemoyer, L. 2018. Jointly 2017YFC1200500), the Research Foundation of Ministry Predicting Predicates and Arguments in Neural Semantic of Education of China (No. 18JZD015), the Major Projects Role Labeling. In Proceedings of the ACL, 364–369. of the National Social Science Foundation of China (No. 11&ZD189). He, S.; Li, Z.; and Zhao, H. 2019. Syntax-aware Mul- tilingual Semantic Role Labeling. In Proceedings of the References EMNLP, 5353–5362. Akbik, A.; Chiticariu, L.; Danilevsky, M.; Li, Y.; Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term Vaithyanathan, S.; and Zhu, H. 2015. Generating high qual- memory. Neural computation 9(8): 1735–1780. ity proposition banks for multilingual semantic role labeling. Jin, G.; Kawahara, D.; and Kurohashi, S. 2015. Chinese Se- In Proceedings of the ACL, 397–407. mantic Role Labeling using High-quality Syntactic Knowl- Akbik, A.; and Li, Y. 2016. Polyglot: Multilingual semantic edge. In Proceedings of the IJCNLP, 120–127. role labeling with unified labels. In Proceedings of the ACL, Kasai, J.; Friedman, D.; Frank, R.; Radev, D.; and Rambow, 1–6. O. 2019. Syntax-aware Neural Semantic Role Labeling with Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural Ma- Supertags. In Proceedings of the NAACL, 701–709. chine Translation by Jointly Learning to Align and Trans- Lei, T.; Zhang, Y.; Marquez,` L.; Moschitti, A.; and Barzilay, late. In Proceedings of the ICLR. R. 2015. High-Order Low-Rank Tensors for Semantic Role Bjorkelund,¨ A.; Bohnet, B.; Hafdell, L.; and Nugues, P. Labeling. In Proceedings of the NAACL, 1150–1160. 2010. A High-Performance Syntactic and Semantic Depen- Li, J.; Ye, D.; and Shang, S. 2019. Adversarial Transfer for dency Parser. In Proceedings of the Coling, 33–36. Named Entity Boundary Detection with Pointer Networks. Cai, J.; He, S.; Li, Z.; and Zhao, H. 2018. A Full End-to-End In Proceedings of the IJCAI, 5053–5059. Semantic Role Labeler, Syntactic-agnostic Over Syntactic- aware? In Proceedings of the CoLing, 2753–2765. Li, Z.; He, S.; Zhao, H.; Zhang, Y.; Zhang, Z.; Zhou, X.; and Zhou, X. 2019. Dependency or Span, End-to-End Uniform Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. Semantic Role Labeling. In Proceedings of the AAAI, 6730– BERT: Pre-training of Deep Bidirectional Transformers for 6737. Language Understanding. In Proceedings of the NAACL, 4171–4186. Luong, T.; Pham, H.; and Manning, C. D. 2015. Effective Approaches to Attention-based Neural . Dozat, T.; and Manning, C. D. 2017. Deep Biaffine Atten- In Proceedings of the EMNLP, 1412–1421. tion for Neural Dependency Parsing. In Proceedings of the ICLR. Ma, X.; Hu, Z.; Liu, J.; Peng, N.; Neubig, G.; and Hovy, E. 2018. Stack-Pointer Networks for Dependency Parsing. In Fei, H.; Ren, Y.; and Ji, D. 2020. High-order Refining for Proceedings of the ACL, 1403–1414. End-to-end Chinese Semantic Role Labeling. In Proceed- ings of the AACL, 100–105. Marcheggiani, D.; Frolov, A.; and Titov, I. 2017. A Fei, H.; Zhang, M.; and Ji, D. 2020. Cross-Lingual Semantic Simple and Accurate Syntax-Agnostic Neural Model for Role Labeling with High-Quality Translated Training Cor- Dependency-based Semantic Role Labeling. In Proceedings pus. In Proceedings of the ACL, 7014–7026. of the CoNLL, 411–420. Fei, H.; Zhang, M.; Li, F.; and Ji, D. 2020. Cross-lingual Marcheggiani, D.; and Titov, I. 2017. Encoding Sentences Semantic Role Labeling with Model Transfer. IEEE/ACM with Graph Convolutional Networks for Semantic Role La- Transactions on Audio, Speech, and Language Processing beling. In Proceedings of the EMNLP, 1506–1515. 28: 2427–2437. Pradhan, S.; Ward, W.; Hacioglu, K.; Martin, J.; and Juraf- Fernandez-Gonz´ alez,´ D.; and Gomez-Rodr´ ´ıguez, C. 2020. sky, D. 2005. Semantic Role Labeling Using Different Syn- Transition-based Semantic Dependency Parsing with tactic Views. In Proceedings of the ACL, 581–588. Pointer Networks. In Proceedings of the ACL, 7035–7046. Punyakanok, V.; Roth, D.; and Yih, W. 2008. The Impor- FitzGerald, N.; Tackstr¨ om,¨ O.; Ganchev, K.; and Das, D. tance of Syntactic Parsing and Inference in Semantic Role 2015. Semantic Role Labeling with Neural Network Fac- Labeling. Computational Linguistics 34(2): 257–287. tors. In Proceedings of the EMNLP, 960–970. Roth, M.; and Lapata, M. 2016. Neural Semantic Role La- Gildea, D.; and Jurafsky, D. 2000. Automatic Labeling of beling with Dependency Path Embeddings. In Proceedings Semantic Roles. In Proceedings of the ACL, 512–520. of the ACL, 1192–1202. Hajic,ˇ J.; Ciaramita, M.; Johansson, R.; Kawahara, D.; Scheible, C. 2010. An Evaluation of Predicate Argument Mart´ı, M. A.; Marquez,` L.; Meyers, A.; Nivre, J.; Pado,´ S.; Clustering using Pseudo-Disambiguation. In Proceedings of Stˇ epˇ anek,´ J.; Stranˇak,´ P.; Surdeanu, M.; Xue, N.; and Zhang, the LREC. See, A.; Liu, P. J.; and Manning, C. D. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the ACL, 1073–1083. Strubell, E.; Verga, P.; Andor, D.; Weiss, D.; and McCal- lum, A. 2018. Linguistically-Informed Self-Attention for Semantic Role Labeling. In Proceedings of the EMNLP, 5027–5038. Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the NeurIPS, 3104–3112. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At- tention is All you Need. In Proceedings of the NeurIPS, 5998–6008. Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer Net- works. In Proceedings of the NeurIPS, 2692–2700. Williams, R. J.; and Zipser, D. 1989. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation 1(2): 270–280. Xia, Q.; Li, Z.; Zhang, M.; Zhang, M.; Fu, G.; Wang, R.; and Si, L. 2019. Syntax-Aware Neural Semantic Role Labeling. In Proceedings of the AAAI, 7305–7313. Zhang, Y.; Wang, R.; and Si, L. 2019. Syntax-Enhanced Self-Attention-Based Semantic Role Labeling. In Proceed- ings of the EMNLP, 616–626. Zhao, H.; Chen, W.; Kazama, J.; Uchimoto, K.; and Tori- sawa, K. 2009. Multilingual Dependency Learning: Exploit- ing Rich Features for Tagging Syntactic and Semantic De- pendencies. In Proceedings of the CoNLL, 61–66.