arXiv:1605.07515v2 [cs.CL] 18 Jul 2016 a inee yGle n uasy(02.In (2002). Jurafsky and Gildea by pioneered multi-document was 2012; al., et (Osman 2015). and al., et 2012), (Khan al., summarization et abstractive Xiong detection 2015), Jamal, and Paul 2011; al., et plagiarism (Aziz trans- of recent machine lation range with statistical wide including tasks, a applications processing that for language shown useful nomi- has natural are to work roles Previous applied semantic be syntactic predicates. also to nal can insensitive and are alternations dependency labels provided layer syntactic as the a such beyond provide relations, roles “what” abstraction Semantic of did of set “who” a “whom”). (e.g., to semanticto according relations of sentence arguments predefined a the in label predicates and identify h olof goal The Introduction 1 h ako eatcrl aeig(SRL) labeling role semantic of task The et bandb u method. our by obtained ments improve- qualitative showcase and la- belers, role semantic state-of-the-art previous over results improve can embeddings such that demonstrate experimentally We tions. representa- embedding suitable learns sub- and as paths dependency instances lexicalized of such sequences models. treats existing model by Our well handled predicates, not nested nominal are as and such subordinations phenomena, structures related syntactic and obser- complex the that by vation motivated is use approach techniques. makes Our modeling sequence that neural labeling of for model role novel semantic a introduces paper This erlSmni oeLbln ihDpnec ahEmbeddi Path Dependency with Labeling Role Semantic Neural eatcrl labeling role semantic Abstract subject colo nomtc,Uiest fEdinburgh of University Informatics, of School and 0Ciho tet dnug H 9AB EH8 Edinburgh Street, Crichton 10 ihe Roth Michael object { mroth,mlap SL sto is (SRL) nthat in , and } @inf.ed.ac.uk etrsaental omdlitrcin trig- interactions model to able indicator not syntactic and are lexical features simple difficulty The that in predicates. lies nominal rare as structures, well recursive as other or verbs, conjunctions raising nested and control include Examples training at time. la- infrequently appear the but for decision relevant beling are of that instances involves phenomena labeled short linguistic be fall to input often the results when However, achieve features accuracy. such high on based 2005), models al., statistical et (Palmer annotated PropBank as large such of resources, availability the fea- Through indicator tures. syntactic and lexical on heavily rely arguments to labels rep- 2008). of role Nugues, and form (Johansson assign better to a depen- provide resentation that trees found parse importance 2005; dency and al., et 2008) the (Pradhan al., et features confirmed Punyakanok valu- parse work syntactic relation- most Later of -argument as con- labeling ships. identified syntactic for were on able based trees features stituent work, their sentence the for systems SRL of Outputs 1: Table ehdtoberiigfunds raising trouble had He esrR H a trouble had *He TensorSRL akflwdaaye htms h argument the miss that analyses flawed Asterisks mark money). A1: money; Prop- of in getter defined (A0: Bank as roles predicted with shown are aetos*ehd[trouble had *He mate-tools hswr [He work This iel Lapata Mirella ayR H a trouble had *He easySRL aels*ehd[trouble had *He mateplus ytmAnalysis System otsmni oelbln prahst date to approaches labeling role semantic Most A0 a trouble had ] ruet of Arguments . raising raising A0 A0 ] ] raising raising raising [funds [funds [funds ngs [funds [funds A1 A1 raise He ]. ]. A1 A1 A1 . ]. ]. ]. gered by such phenomena. For instance, consider ROOT the sentence He had trouble raising funds and the analyses provided by four publicly available tools SBJ had OBJ in Table 1 (mate-tools, Bj¨orkelund et al. (2010); he trouble NMOD mateplus, Roth and Woodsend (2014); Ten- A0 raising OBJ sorSRL, Lei et al. (2015); and easySRL, raise.01 funds Lewis et al. (2015)). Despite all systems claiming A1 state-of-the-art or competitive performance, none of them is able to correctly identify He as the Figure 1: Dependency path (dotted) between the agent argument of the predicate raise. Given the predicate raising and the argument he. complex dependency path relation between the predicate and its argument, none of the systems actually identifies He as an argument at all. instance, are typically computed by forwarding a In this paper, we develop a new neural network one-hot vector representation from the input model that can be applied to the task of seman- layer of a neural network to its first hidden layer, tic role labeling. The goal of this model is to bet- usually by means of matrix multiplication and an ter handle control predicates and other phenomena optional non-linear function whose parameters are that can be observed from the dependency struc- learned during neural network training. ture of a sentence. In particular, we aim to model Here, we seek to compute real-valued vector the semantic relationships between a predicate and representations for dependency paths between a its arguments by analyzing the dependency path pair of hwi,w ji. We define a dependency between the predicate word and each argument path to be the sequence of nodes (representing head word. We consider lexicalized paths, which words) and edges (representing relations between we decompose into sequences of individual items, words) to be traversed on a dependency namely the words and dependency relations on to get from node wi to node w j. In the example in a path. We then apply long-short term memory Figure 1, the dependency path from raising to he networks (Hochreiter and Schmidhuber, 1997) to is raising −−−→NMOD trouble −−→OBJ had←−SBJ he. find a recurrent composition function that can re- Analogously to how word embeddings are com- construct an appropriate representation of the full puted, the simplest way to embed paths would path from its individual parts (Section 2). To en- be to represent each sequence as a one-hot vec- sure that representations are indicative of seman- tor. However, this is suboptimal for two reasons: tic relationships, we use semantic roles as target Firstly, we expect only a subset of dependency labels in a supervised setting (Section 3). paths to be attested frequently in our data and By modeling dependency paths as sequences of therefore many paths will be too sparse to learn words and dependencies, we implicitly address the reliable embeddings for them. Secondly, we hy- data sparsity problem. This is the case because we pothesize that dependency paths which share the use single words and individual dependency rela- same words, word categories or dependency re- tions as the basic units of our model. In contrast, lations should impact SRL decisions in similar previous SRL work only considered full syntactic ways. Thus, the words and relations on the path paths. Experiments on the CoNLL-2009 bench- should drive representation learning, rather than mark dataset show that our model is able to out- the full path on its own. The following sections perform the state-of-the-art in English (Section 4), describe how we address representation learning and that it improves SRL performance in other by means of modeling dependency paths as se- languages, including Chinese, German and Span- quences of items in a recurrent neural network. ish (Section 5). 2.1 Recurrent Neural Networks 2 Dependency Path Embeddings The recurrent model we use in this work is a vari- In the context of neural networks, the term embed- ant of the long-short term memory (LSTM) net- ding refers to the output of a function f within the work. It takes a sequence of items X = x1,...,xn network, which transforms an arbitrary input into as input, recurrently processes each item xt ∈ X at a real-valued vector output. Word embeddings, for a time, and finally returns one embedding state en next layer a part-of-speech tag, a word form, or a dependency relation. In the context of semantic role labeling, en we define each path as a sequence from a predicate to its potential argument.1 Specifically, we define

xn the first item x1 to correspond to the part-of-speech x1 tag of the predicate word wi, followed by its actual he N had V trouble N raising V ... word form, and the relation to the next word wi+1. The embedding of a dependency path corresponds subj obj nmod to the state en returned by the LSTM layer after the input of the last item, xn, which corresponds to Figure 2: Example input and embedding compu- the word form of the argument head word w j. An tation for the path from raising to he, given the example is shown in Figure 2. sentence he had trouble raising funds. LSTM time The main idea of this model and representation steps are displayed from right to left. is that word forms, word categories and depen- dency relations can all influence role labeling de- for the complete input sequence. For each time cisions. The word category and word form of the step t, the LSTM model updates an internal mem- predicate first determine which roles are plausible and what kinds of path configurations are to be ex- ory state mt that depends on the current input as pected. The relations and words seen on the path well as the previous memory state mt−1. In or- der to capture long-term dependencies, a so-called can then manipulate these expectations. In Fig- gating mechanism controls the extent to which ure 2, for instance, the verb raising complements each component of a memory cell state will be the phrase had trouble, which makes it likely that modified. In this work, we employ input gates i, the subject he is also the logical subject of raising. output gates o and (optional) forget gates f. We By using word forms, categories and depen- formalize the state of the network at each time dency relations as input items, we ensure that spe- step t as follows: cific words (e.g., those which are part of com- plex predicates) as well as various relation types (e.g., subject and object) can appropriately influ- σ mi xi i it = ([W mt−1]+ W xt + b ) (1) ence the representation of a path. While learn- mf xf f ft = σ([W mt−1]+ W xt + b ) (2) ing corresponding interactions, the network is also xm m able to determine which phrases and dependency mt = it ⊙ (W xt )+ ft ⊙ mt−1 + b (3) relations might not influence a role assignment de- o = σ([Wmom ]+ Wxox + bo) (4) t t t cision (e.g., coordinations). et = ot ⊙ σ(mt ) (5) 2.3 Joint Embedding and Feature Learning Our SRL model consists of four components de- In each equation, W describes a matrix of picted in Figure 3: (1) an LSTM component takes weights to project information between two lay- lexicalized dependency paths as input, (2) an ad- ers, b is a layer-specific vector of bias terms, and ditional input layer takes binary features as input, σ is the logistic function. Superscripts indicate (3) a hidden layer combines dependency path em- the corresponding layers or gates. Some models beddings and binary features using rectified linear described in Section 3 do not make use of forget units, and (4) a softmax classification layer pro- gates or memory-to-gate connections. In case no duces output based on the hidden layer state as in- forget gate is used, we set f = 1. If no memory- t put. We therefore learn path embeddings jointly to-gate connections are used, the terms in square with feature detectors based on traditional, binary brackets in (1), (2), and (4) are replaced by zeros. indicator features. 2.2 Embedding Dependency Paths Given a dependency path X, with steps xk ∈ x ,...,x , and a set of binary features B as in- We define the embedding of a dependency path { 1 n} put, we use the LSTM formalization from equa- to be the final memory output state of a recur- rent LSTM layer that takes a path as input, with 1We experimented with different sequential orders and each input step representing a binary indicator for found this to lead to the best validation set results. Input from path X x x PREDICATE 1 ... n He had trouble raising funds. IDENTIFICATION

i 1 ... sense PREDICATE m0 m1 h class f1 o1 e1 en s DISAMBIGUATION raise.01 (1) LSTM cell label c Vector of binary indicator 1st best arg (4) ARGUMENT features B (3) 2nd best arg IDENTIFICATION (2) ARG? ARG?

best label 2nd Figure 3: Neural model for joint learning of path ARGUMENT best label best embeddings and higher-order features: The path CLASSIFICATION A0 A1 A0 sequence x1 ...xn is fed into a LSTM layer, a hid- den layer h combines the final embedding en and binary input features B, and an output layer s as- raise.01 raise.01 raise.01 RERANKER he funds funds funds signs the highest probable class label c. A0 A1 score score best overall scoring structure tions (1–5) to compute the embedding en at time step n and formalize the state of the hidden layer h OUTPUT HeA0 had trouble raising fundsA1. and softmax output sc for each class category c as follows: Figure 4: Pipeline architecture of our SRL system. Bh eh h h = max(0,W B + W en + b ) (6) Wese + Whsh + bs (e.g., prepositions; Srikumar and Roth (2013)). s = c n c c (7) c Σ es hs s For both identification and disambiguation steps, i(Wi en + Wi h + bi ) we apply the same logistic regression classi- 3 System Architecture fiers used in the SRL components of mate-tools (Bj¨orkelund et al., 2010). The classifiers for both The overall architecture of our SRL sys- tasks make use of a range of lexico-syntactic indi- tem closely follows that of previous work cator features, including predicate word form, its (Toutanova et al., 2008; Bj¨orkelund et al., 2009) predicted part-of-speech tag as well as dependency and is depicted in Figure 4. We use a pipeline that relations to all syntactic children. consists of the following steps: predicate identi- fication and disambiguation, argument identifica- 3.2 Argument Identification and tion, argument classification, and re-ranking. The Classification neural-network components introduced in Sec- tion 2 are used in the last three steps. The follow- Given a sentence and a set of sense-disambiguated ing sub-sections describe all components in more predicates in it, the next two steps of our SRL sys- detail. tem are to identify all arguments of each pred- icate and to assign suitable role labels to them. 3.1 Predicate Identification and For both steps, we train several LSTM-based neu- Disambiguation ral network models as described in Section 2. In Given a syntactically analyzed sentence, the particular, we train separate networks for nomi- first two steps in an end-to-end SRL system nal and verbal predicates and for identification and are to identify and disambiguate the seman- classification. Following the findings of earlier tic predicates in the sentence. Here, we fo- work (Xue and Palmer, 2004), we assume that dif- cus on verbal and nominal predicates but note ferent feature sets are relevant for the respective that other syntactic categories have also been tasks and hence different embedding representa- construed as predicates in the NLP literature tions should be learned. As binary input features, we use the following sets from the SRL literature https://github.com/microth/PathLSTM. (Bj¨orkelund et al., 2010). Model selection We train argument iden- Lexico-syntactic features Word form and word tification and classification models us- category of the predicate and candidate argument; ing the XLBP toolkit for neural networks dependency relations from predicate and argument (Monner and Reggia, 2012). The hyperparam- to their respective syntactic heads; full depen- eters for each step were selected based on the dency path sequence from predicate to argument. CoNLL 2009 development set. For direct com- parison with previous work, we use the same Local context features Word forms and word preprocessing models and predicate-specific categories of the candidate argument’s and pred- SRL components as provided with mate-tools icate’s syntactic siblings and children words. (Bohnet, 2010; Bj¨orkelund et al., 2010). The Other features Relative position of the candi- types and ranges of hyperparameters considered date argument with respect to the predicate (left, are as follows: learning rate α ∈ [0.00006,0.3], self, right); sequence of part-of-speech tags of all dropout rate d ∈ [0.0,0.5], and hidden layer sizes words between the predicate and the argument. |e| ∈ [0,100], |h| ∈ [0,500]. In addition, we experimented with different gating mechanisms 3.3 Reranker (with/without forget gate) and memory access As all argument identification (and classification) settings (with/without connections between all decisions are independent of one another, we gates and the memory layer, cf. Section 2). apply as the last step of our pipeline a global The best parameters were chosen using the reranker. Given a predicate p, the reranker takes Spearmint hyperparameter optimization toolkit as input the n best sets of identified arguments as (Snoek et al., 2012), applied for approx. 200 well as their n best label assignments and predicts iterations, and are summarized in Table 2. the best overall argument structure. We implement Results The results of our in- and out-of-domain the reranker as a logistic regression classifier, with experiments are summarized in Tables 3 and 5, re- hidden and embedding layer states of identified spectively. We present results for different system arguments as features, offset by the argument la- configurations: ‘local’ systems make classification bel, and a binary label as output (1: best predicted decisions independently, whereas ‘global’ systems structure, 0: any other structure). At test time, we include a reranker or other global inference mech- select the structure with the highest overall score, anisms; ‘single’ refers to one model and ‘ensem- which we compute as the geometric mean of the ble’ refers to combinations of multiple models. global regression and all argument-specific scores. In the in-domain setting, our PathLSTM model 4 Experiments achieves 87.7% (single) and 87.9% (ensemble) F1- score, outperforming previously published best re- In this section, we demonstrate the usefulness of sults by 0.4 and 0.2 percentage points, respec- dependency path embeddings for semantic role la- tively. At a F1-score of 86.7%, our local model beling. Our hypotheses are that (1) modeling de- (using no reranker) reaches the same performance pendency paths as sequences will lead to better as state-of-the-art local models. Note that dif- representations for the SRL task, thus increasing ferences in results between systems might origi- labeling precision overall, and that (2) embeddings nate from the application of different preprocess- will address the problem of data sparsity, lead- ing techniques as each system comes with its own ing to higher recall. To test both hypotheses, we syntactic components. For direct comparison, we experiment on the in-domain and out-of-domain evaluate against mate-tools, which use the same test sets provided in the CoNLL-2009 shared task preprocessing techniques as PathLSTM. In com- (Hajiˇcet al., 2009) and compare results of our sys- parison, we see improvements of +0.8–1.0 per- tem, henceforth PathLSTM, with systems that centage points absolute in F1-score. do not involve path embeddings. We compute In the out-of-domain setting, our system precision, recall and F1-score using the official achieves new state-of-the-art results of 76.1% CoNLL-2009 scorer.2 The code is available at nal predicates or dependency annotations. We do not list any 2Some recently proposed SRL models are only evaluated results from those models here. on the CoNLL 2005 and 2012 data sets, which lack nomi- 3Results are taken from Lei et al. (2015). Argument labeling step forget gate memory→gates |e| |h| alpha dropout rate Identification (verb) − + 25 90 0.0006 0.42 Identification (noun) − + 16 125 0.0009 0.25 Classification (verb) + − 5 300 0.0155 0.50 Classification (noun) − − 88 500 0.0055 0.46

Table 2: Hyperparameters selected for best models and training procedures

System (local, single) PRF1 System (local, single) PRF1 Bj¨orkelund et al. (2010) 87.1 84.5 85.8 Bj¨orkelund et al. (2010) 75.7 72.2 73.9 Lei et al. (2015) − − 86.6 Lei et al. (2015) − − 75.6 FitzGerald et al. (2015) − − 86.7 FitzGerald et al. (2015) − − 75.2 PathLSTM w/o reranker 88.1 85.3 86.7 PathLSTM w/o reranker 76.9 73.8 75.3

System (global, single) PRF1 System (global, single) PRF1 Bj¨orkelund et al. (2010) 88.6 85.2 86.9 Bj¨orkelund et al. (2010) 77.9 73.6 75.7 Roth and Woodsend (2014)3 − − 86.3 Roth and Woodsend (2014)3 − − 75.9 FitzGerald et al. (2015) − − 87.3 FitzGerald et al. (2015) − − 75.2 PathLSTM 90.0 85.5 87.7 PathLSTM 78.6 73.8 76.1

System (global, ensemble) PRF1 System (global, ensemble) PRF1 FitzGerald et al. 10 models − − 87.7 FitzGerald et al. 10 models − − 75.5 PathLSTM 3 models 90.3 85.7 87.9 PathLSTM 3 models 79.7 73.6 76.5

Table 3: Results on the CoNLL-2009 in-domain Table 5: Results on the CoNLL-2009 out-of- test set. All numbers are in percent. domain test set. All numbers are in percent.

PathLSTM P (%) R (%) F (%) 1 identify and label arguments with high precision. w/o path embeddings 65.7 87.3 75.0 Figure 5 shows the effect of dependency path w/o binary features 73.2 33.3 45.8 embeddings at mitigating sparsity: if the path be- tween a predicate and its argument has not been Table 4: Ablation tests in the in-domain setting. observed at training time or only infrequently, conventional methods will often fail to assign a role. This is represented by the recall curve (single) and 76.5% (ensemble) F1-score, out- performing the previous best system by of mate-tools, which converges to zero for argu- Roth and Woodsend (2014) by 0.2 and 0.6 ments with unseen paths. The higher recall curve absolute points, respectively. In comparison to for PathLSTM demonstrates that path embeddings mate-tools, we observe absolute improvements in can alleviate this problem to some extent. For un- seen paths, we observe that PathLSTM improves F1-score of +0.4–0.8%. over mate-tools by an order of magnitude, from Discussion To determine the sources of indi- 0.9% to 9.6%. The highest absolute gain, from vidual improvements, we test PathLSTM models 12.8% to 24.2% recall, can be observed for depen- without specific feature types and directly com- dency paths that occurred between 1 and 10 times pare PathLSTM and mate-tools, both of which use during training. the same preprocessing methods. Table 4 presents Figure 7 plots role labeling performance for in-domain test results for our system when spe- sentences with varying number of words. There cific feature types are omitted. The overall low are two categories of sentences in which the im- results indicate that a combination of dependency provements of PathLSTM are most noticeable: path embeddings and binary features is required to Firstly, it better handles short sentences that con- (Nested) subject control Coordinations involving A0 KeatingA0 has conceded attempted to buy

Complements of nominal predicates

treasuryA0 ’s threat to trash

Coordinations involving A1 Relative clauses tradingA1 was stopped the firmA1 , which was involved and did not resume ◦ A0 • A1

Figure 6: Dots correspond to the path representation of a predicate-argument instance in 2D space. White/black color indicates A0/A1 gold argument labels. Dotted ellipses denote instances exhibiting related syntactic phenomena (see rectangles for a description and dotted rectangles for linguistic exam- ples). Example phrases show actual output produced by PathLSTM (underlined).

100 76.4 90.1 (+0.8) 90 36.4 88.3 89 24.2 49.6 89.3 87.9 (+0.4) 30.1 88 87.1 (+1.0) 10 87 87.5

12.8 -score (%)

9.6 1 F

Recall (%) 86 86.1 PathLSTM 85 PathLSTM 0.9 mate-tools 1 mate-tools 0 2 3 4 4 10 10 10 10 1–10 ≤ ≤ ≤ > 31– 1–10 11–15 16–20 21–25 26–30 Number of training instances with same path Number of words

Figure 5: Results on in-domain test instances, Figure 7: Results by sentence length. Improve- grouped by the number of training instances that ments over mate-tools shown in parentheses. have an identical (unlexicalized) dependency path.

learned dependency path representations in the tain expletives and/or nominal predicates (+0.8% embedding space (see Figure 6). We obtain absolute in F1-score). This is probably due to the a projection onto two dimensions using t-SNE fact that our learned dependency path representa- (Van der Maaten and Hinton, 2008). Interestingly, tions are lexicalized, making it possible to model we can see that different syntactic configurations argument structures of different nominals and dis- are clustered together in different parts of the tinguishing between expletive occurrences of ‘it’ space and that most instances of the PropBank and other subjects. Secondly, it improves perfor- roles A0 and A1 are separated. Example phrases mance on longer sentences (up to +1.0% absolute in the figure highlight predicate-argument pairs in F1-score). This is mainly due to the handling of that are correctly labeled by PathLSTM but not dependency paths that involve complex structures, by mate-tools. Path embeddings are essential for such as coordinations, control verbs and nominal handling these cases as indicator features do not predicates. generalize well enough. We collect instances of different syntactic phe- Finally, Table 6 shows results for nominal and nomena from the development set and plot the verbal predicates as well as for different (gold) Predicate POS Improvement Chinese P R F1 & Role Label PathLSTM over mate-tools PathLSTM 83.2 75.9 79.4 P(%) R(%) P(%) R(%) Bj¨orkelund et al. (2009) 82.4 75.1 78.6 Zhao et al. (2009) 80.4 75.2 77.7 verb / A0 90.8 89.2 −0.4 +1.8

verb / A1 91.0 91.9 +0.0 +1.1 German P R F1 verb / A2 84.3 76.9 +1.5 +0.0 PathLSTM 81.8 78.5 80.1 verb / AM 82.2 72.4 +2.9 −2.0 Bj¨orkelund et al. (2009) 81.2 78.3 79.7 noun / A0 86.9 78.2 +0.8 +3.3 Che et al. (2009) 82.1 75.4 78.6 noun / A1 87.5 84.4 +2.6 +2.2 noun / A2 82.4 76.8 +1.0 +2.1 Spanish P R F1 noun / AM 79.5 69.2 +0.9 −2.8 Zhao et al. (2009) 83.1 78.0 80.5 PathLSTM 83.2 77.4 80.2 Table 6: Results by word category and role label. Bj¨orkelund et al. (2009) 78.9 74.3 76.5 role labels. In comparison to mate-tools, we can Table 7: Results (in percentage) on the CoNLL- see that PathLSTM improves precision for all ar- 2009 test sets for Chinese, German and Spanish. gument types of nominal predicates. For ver- bal predicates, improvements can be observed in 6 Related Work terms of recall of proto-agent (A0) and proto- patient (A1) roles, with slight gains in precision Neural Networks for SRL for the A2 role. Overall, PathLSTM does slightly Collobert et al. (2011) pioneered neural net- worse with respect to modifier roles, which it la- works for the task of semantic role labeling. They bels with higher precision but at the cost of recall. developed a feed-forward network that uses a convolution function over windows of words 5 Path Embeddings in other Languages to assign SRL labels. Apart from constituency boundaries, their system does not make use of any In this section, we report results from additional syntactic information. Foland and Martin (2015) experiments on Chinese, German and Spanish extended their model and showcased significant data. The underlying question is to which extent improvements when including binary indicator the improvements of our SRL system for English features for dependency paths. Similar features also generalize to other languages. To answer this were used by FitzGerald et al. (2015), who in- question, we train and test separate SRL mod- clude role labeling predictions by neural networks els for each language, using the system architec- as factors in a global model. ture and hyperparameters discussed in Sections 3 These approaches all make use of binary fea- and 4, respectively. tures derived from syntactic parses either to indi- We train our models on data from the cate constituency boundaries or to represent full CoNLL-2009 shared task, relying on the same dependency paths. An extreme alternative has features as one of the participating systems been recently proposed in Zhou and Xu (2015), (Bj¨orkelund et al., 2009), and evaluate with the of- who model SRL decisions with a multi-layered ficial scorer. For direct comparison, we rely on LSTM network that takes word sequences as in- the (automatic) syntactic preprocessing informa- put but no syntactic parse information at all. tion provided with the CoNLL test data and com- Our approach falls in between the two extremes: pare our results with the best two systems for each we rely on syntactic parse information but rather language that make use of the same preprocessing than solely making using of sparse binary features, information. we explicitly model dependency paths in a neural The results, summarized in Table 7, indicate network architecture. that PathLSTM performs better than the system by Bj¨orkelund et al. (2009) in all cases. For German Other SRL approaches Within the SRL lit- and Chinese, PathLSTM achieves the best overall erature, recent alternatives to neural network F1-scores of 80.1% and 79.4%, respectively. architectures include sigmoid belief networks (Henderson et al., 2013) as well as low-rank ten- 7 Conclusions sor models (Lei et al., 2015). Whereas Lei et al. We introduced a neural network architecture for only make use of dependency paths as binary in- semantic role labeling that jointly learns embed- dicator features, Henderson et al. propose a joint dings for dependency paths and feature combina- model for syntactic and semantic that tions. Our experimental results indicate that our learns and applies incremental dependency path model substantially increases classification perfor- representations to perform SRL decisions. The lat- mance, leading to new state-of-the-art results. In a ter form of representation is closest to ours, how- qualitive analysis, we found that our model is able ever, we do not build syntactic parses incremen- to cover instances of various linguistic phenomena tally. Instead, we take syntactically preprocessed that are missed by other methods. text as input and focus on the SRL task only. Beyond SRL, we expect dependency path em- Apart from more powerful models, most recent beddings to be useful in related tasks and down- progress in SRL can be attributed to novel fea- stream applications. For instance, our represen- tures. For instance, Deschacht and Moens (2009) tations may be of direct benefit for semantic and and Huang and Yates (2010) use latent variables, discourse parsing tasks. The jointly learned fea- learned with a hidden markov model, as fea- ture space also makes our model a good starting tures for representing words and word se- point for cross-lingual transfer methods that rely quences. Zapirain et al. (2013) propose dif- on feature representation projection to induce new ferent selection preference models in order models (Kozhevnikov and Titov, 2014). to deal with the sparseness of lexical fea- tures. Roth and Woodsend (2014) address the Acknowledgements We thank the three anony- same problem with word embeddings and compo- mous ACL referees whose feedback helped to sitions thereof. Roth and Lapata (2015) recently substantially improve the present paper. The introduced features that model the influence of dis- support of the Deutsche Forschungsgemeinschaft course on role labeling decisions. (Research Fellowship RO 4848/1-1; Roth) and Rather than coming up with completely new the European Research Council (award number features, in this work we proposed to revisit some 681760; Lapata) is gratefully acknowledged. well-known features and represent them in a novel way that generalizes better. Our proposed model References is inspired both by the necessity to overcome the problems of sparse lexico-syntactic features and Wilker Aziz, Miguel Rios, and Lucia Specia. 2011. by the recent success of SRL models based on neu- Shallow semantic trees for smt. In Proceedings of the Sixth Workshop on Statistical Machine Transla- ral networks. tion, pages 316–322, Edinburgh, Scotland. Dependency-based embeddings The idea of Anders Bj¨orkelund, Love Hafdell, and Pierre Nugues. embedding dependency structures has previously 2009. Multilingual semantic role labeling. In Pro- ceedings of the Thirteenth Conference on Compu- been applied to tasks such as relation classifica- tational Natural Language Learning: Shared Task, tion and . Xu et al. (2015) and pages 43–48, Boulder, Colorado. Liu et al. (2015) use neural networks to embed de- Anders Bj¨orkelund, Bernd Bohnet, Love Hafdell, and pendency paths between entity pairs. To identify Pierre Nugues. 2010. A high-performance syn- the relation that holds between two entities, their tactic and semantic dependency parser. In Coling approaches make use of pooling layers that de- 2010: Demonstration Volume, pages 33–36, Beijing, tect parts of a path that indicate a specific rela- China. tion. In contrast, our work aims at modeling an Bernd Bohnet. 2010. Top accuracy and fast depen- individual path as a complete sequence, in which dency parsing is not a contradiction. In Proceedings every item is of relevance. Tai et al. (2015) and of the 23rd International Conference on Computa- tional Linguistics, pages 89–97, Beijing, China. Ma et al. (2015) learn embeddings of dependency structures representing full sentences, in a senti- Wanxiang Che, Zhenghua Li, Yongqiang Li, Yuhang ment classification task. In our model, embed- Guo, Bing Qin, and Ting Liu. 2009. Multilin- gual dependency-based syntactic and semantic pars- dings are learned jointly with other features, and as ing. In Proceedings of the Thirteenth Conference on a result problems that may result from erroneous Computational Natural Language Learning: Shared parse trees are mitigated. Task, pages 49–54, Boulder, Colorado. Ronan Collobert, Jason Weston, L´eon Bottou, Michael Mikhail Kozhevnikov and Ivan Titov. 2014. Cross- Karlen, Koray Kavukcuoglu, and Pavel Kuksa. lingual model transfer using feature representation 2011. Natural language processing (almost) from projection. In Proceedings of the 52nd Annual scratch. The Journal of Machine Learning Re- Meeting of the Association for Computational Lin- search, 12:2493–2537. guistics, pages 579–585, Baltimore, Maryland.

Koen Deschacht and Marie-Francine Moens. 2009. Tao Lei, Yuan Zhang, Llu´ıs M`arquez, Alessandro Mos- Semi-supervised semantic role labeling using the chitti, and Regina Barzilay. 2015. High-order low- Latent Words Language Model. In Proceedings of rank tensors for semantic role labeling. In Proceed- the 2009 Conference on Empirical Methods in Nat- ings of the 2015 Conference of the North Ameri- ural Language Processing, pages 21–29, Singapore. can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages Nicholas FitzGerald, Oscar T¨ackstr¨om, Kuzman Ganchev, and Dipanjan Das. 2015. Semantic role 1150–1160, Denver, Colorado. labeling with neural network factors. In Proceed- ings of the 2015 Conference on Empirical Methods Mike Lewis, Luheng He, and Luke Zettlemoyer. 2015. in Natural Language Processing, pages 960–970, Joint A* CCG parsing and semantic role labelling. Lisbon, Portugal. In Proceedings of the 2015 Conference on Empiri- cal Methods in Natural Language Processing, pages William Foland and James Martin. 2015. 1444–1454, Lisbon, Portugal. Dependency-based semantic role labeling us- ing convolutional neural networks. In Proceedings Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, of the Fourth Joint Conference on Lexical and and Houfeng Wang. 2015. A dependency-based , pages 279–288, Denver, neural network for relation classification. In Pro- Colorado. ceedings of the 53rd Annual Meeting of the Associ- ation for Computational Linguistics and the 7th In- Daniel Gildea and Daniel Jurafsky. 2002. Automatic ternational Joint Conference on Natural Language labeling of semantic roles. Computational Linguis- Processing, pages 285–290, Beijing, China. tics, 28(3):245–288. Mingbo Ma, Liang Huang, Bowen Zhou, and Bing Xi- Jan Hajiˇc, Massimiliano Ciaramita, Richard Johans- ang. 2015. Dependency-based convolutional neural son, Daisuke Kawahara, Maria Ant`onia Mart´ı, Llu´ıs networks for sentence embedding. In Proceedings M`arquez, Adam Meyers, Joakim Nivre, Sebastian ˇ of the 53rd Annual Meeting of the Association for Pad´o, Jan Stˇep´anek, et al. 2009. The CoNLL-2009 Computational Linguistics and the 7th International shared task: Syntactic and semantic dependencies Joint Conference on Natural Language Processing, in multiple languages. In Proceedings of the Thir- pages 174–179, Beijing, China. teenth Conference on Computational Natural Lan- guage Learning: Shared Task, pages 1–18, Boulder, Derek Monner and James A Reggia. 2012. A general- Colorado. ized LSTM-like training algorithm for second-order James Henderson, Paola Merlo, Ivan Titov, and recurrent neural networks. Neural Networks, 25:70– Gabriele Musillo. 2013. Multilingual joint pars- 83. ing of syntactic and semantic dependencies with a latent variable model. Computational Linguistics, Ahmed Hamza Osman, Naomie Salim, Mo- 39(4):949–998. hammed Salem Binwahlan, Rihab Alteeb, and Albaraa Abuobieda. 2012. An improved plagiarism Sepp Hochreiter and J¨urgen Schmidhuber. 1997. detection scheme based on semantic role labeling. Long short-term memory. Neural Computation, Applied Soft Computing, 12(5):1493–1502. 9(8):1735–1780. Martha Palmer, Daniel Gildea, and Paul Kingsbury. Fei Huang and Alexander Yates. 2010. Open-domain 2005. The Proposition bank: An annotated cor- semantic role labeling by modeling word spans. In pus of semantic roles. Computational Linguistics, Proceedings of the 48th Annual Meeting of the As- 31(1):71–106. sociation for Computational Linguistics, pages 968– 978, Uppsala, Sweden. Merin Paul and Sangeetha Jamal. 2015. An im- Richard Johansson and Pierre Nugues. 2008. The ef- proved SRL based plagiarism detection technique fect of syntactic representation on semantic role la- using sentence ranking. Procedia Computer Sci- beling. In Proceedings of the 22nd International ence, 46:223–230. Conference on Computational Linguistics, pages 393–400, Manchester, United Kingdom. Sameer Pradhan, Kadri Hacioglu, Wayne Ward, James H. Martin, and Daniel Jurafsky. 2005. Se- Atif Khan, Naomie Salim, and Yogan Jaya Kumar. mantic role chunking combining complementary 2015. A framework for multi-document abstrac- syntactic views. In Proceedings of the Ninth Confer- tive summarization based on semantic role labelling. ence on Computational Natural Language Learning, Applied Soft Computing, 30:737–747. pages 217–220, Ann Arbor, Michigan. Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2008. Be˜nat Zapirain, Eneko Agirre, Llu´ıs M`arquez, and Mi- The importance of syntactic parsing and inference in hai Surdeanu. 2013. Selectional preferences for se- semantic role labeling. Computational Linguistics, mantic role classification. Computational Linguis- 34(2):257–287. tics, 39(3):631–663.

Michael Roth and Mirella Lapata. 2015. Context- Hai Zhao, Wenliang Chen, Jun’ichi Kazama, Kiyotaka aware frame-semantic role labeling. Transactions Uchimoto, and Kentaro Torisawa. 2009. Multi- of the Association for Computational Linguistics, lingual dependency learning: Exploiting rich fea- 3:449–460. tures for tagging syntactic and semantic dependen- cies. In Proceedings of the Thirteenth Confer- Michael Roth and Kristian Woodsend. 2014. Com- ence on Computational Natural Language Learning position of word representations improves seman- (CoNLL 2009): Shared Task, pages 61–66, Boulder, tic role labelling. In Proceedings of the 2014 Con- Colorado. ference on Empirical Methods in Natural Language Jie Zhou and Wei Xu. 2015. End-to-end learning of Processing, pages 407–413, Doha, Qatar. semantic role labeling using recurrent neural net- works. In Proceedings of the 53rd Annual Meet- Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. ing of the Association for Computational Linguistics 2012. Practical bayesian optimization of machine and the 7th International Joint Conference on Nat- learning algorithms. In Advances in Neural Infor- ural Language Processing, pages 1127–1137, Bei- mation Processing Systems, pages 2951–2959, Lake jing, China. Tahoe, Nevada.

Vivek Srikumar and Dan Roth. 2013. Modeling se- mantic relations expressed by prepositions. Trans- actions of the Association for Computational Lin- guistics, 1:231–242.

Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representa- tions from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguistics and the 7th International Joint Conference on Nat- ural Language Processing, pages 1556–1566, Bei- jing, China.

Kristina Toutanova, Aria Haghighi, and Christopher Manning. 2008. A global joint model for se- mantic role labeling. Computational Linguistics, 34(2):161–191.

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605.

Deyi Xiong, Min Zhang, and Haizhou Li. 2012. Mod- eling the translation of predicate-argument structure for smt. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguis- tics, pages 902–911, Jeju Island, Korea.

Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. 2015. Classifying relations via long short term memory networks along shortest depen- dency paths. In Proceedings of the 2015 Confer- ence on Empirical Methods in Natural Language Processing, pages 1785–1794, Lisbon, Portugal.

Nianwen Xue and Martha Palmer. 2004. Calibrat- ing features for semantic role labeling. In Proceed- ings of the 2004 Conference on Empirical Meth- ods in Natural Language Processing, pages 88–94, Barcelona, Spain.