Syntax Helps Elmo Understand Semantics: Is Syntax Still Relevant in a Deep Neural Architecture for SRL?
Total Page:16
File Type:pdf, Size:1020Kb
Syntax Helps ELMo Understand Semantics: Is Syntax Still Relevant in a Deep Neural Architecture for SRL? Emma Strubell Andrew McCallum College of Information and Computer Sciences University of Massachusetts Amherst fstrubell, [email protected] Abstract work architecture for e.g. semantic role labeling (SRL). Do unsupervised methods for learning In this work, we aim to begin to answer this rich, contextualized token representations question by experimenting with incorporating the obviate the need for explicit modeling of ELMo embeddings of Peters et al.(2018) into linguistic structure in neural network mod- LISA (Strubell et al., 2018), a “linguistically- els for semantic role labeling (SRL)? We informed” deep neural network architecture for address this question by incorporating the SRL which, when given weaker GloVe embed- massively successful ELMo embeddings dings as inputs (Pennington et al., 2014), has been (Peters et al., 2018) into LISA (Strubell shown to leverage syntax to out-perform a state- et al., 2018), a strong, linguistically- of-the-art, linguistically-agnostic end-to-end SRL informed neural network architecture for model. SRL. In experiments on the CoNLL-2005 In experiments on the CoNLL-2005 English shared task we find that though ELMo out- SRL shared task, we find that, while the ELMo performs typical word embeddings, begin- representations out-perform GloVe and begin to ning to close the gap in F1 between LISA close the performance gap between LISA with with predicted and gold syntactic parses, predicted and gold syntactic parses, syntactically- syntactically-informed models still out- informed models still out-perform syntax-free perform syntax-free models when both use models, especially on out-of-domain data. Our re- ELMo, especially on out-of-domain data. sults suggest that with the right modeling, incorpo- Our results suggest that linguistic struc- rating linguistic structures can indeed further im- tures are indeed still relevant in this golden prove strong neural network models for NLP. age of deep learning for NLP. 1 Introduction 2 Models Many state-of-the-art NLP models are now “end- We are interested in assessing whether linguistic to-end” deep neural network architectures which information is still beneficial in addition to deep, eschew explicit linguistic structures as input in contextualized ELMo word embeddings in a neu- arXiv:1811.04773v1 [cs.CL] 12 Nov 2018 favor of operating directly on raw text (Ma and ral network model for SRL. Towards this end, our Hovy, 2016; Lee et al., 2017; Tan et al., 2018). base model for experimentation is Linguistically- Recently, Peters et al.(2018) proposed a method Informed Self-Attention (LISA) (Strubell et al., for unsupervised learning of rich, contextually- 2018), a deep neural network model which uses encoded token representations which, when sup- multi-head self-attention in the style of Vaswani plied as input word representations in end-to-end et al.(2017) for multi-task learning (Caruana, models, further increased these models’ perfor- 1993) across SRL, predicate detection, part-of- mance by up to 25% across many NLP tasks. The speech tagging and syntactic parsing. Syntax is immense success of these linguistically-agnostic incorporated by training one self-attention head to models brings into question whether linguistic attend to each token’s syntactic head, allowing it structures such as syntactic parse trees still pro- to act as an oracle providing syntactic information vide any additional benefits in a deep neural net- to further layers used to predict semantic roles. We (j) summarize the key points of LISA in x2.1. and Qh of dimensions T ×dk, T ×dq, and T ×dv, Strubell et al.(2018) showed that LISA out- (j) (j) respectively. We can then multiply Qh by Kh performs syntax-free models when both use GloVe (j) to obtain a T × T matrix of attention weights A word embeddings as input, which, due to their h between each pair of tokens in the sentence. Fol- availability, size and large training corpora, are lowing Vaswani et al.(2017) we perform scaled typically used as input to end-to-end NLP models. dot-product attention: We scale the weights by the In this work, we replace those token representa- inverse square root of their embedding dimension tions with ELMo representations to assess whether and normalize with the softmax function to pro- ELMo embeddings are sufficiently rich to obvi- duce a distinct distribution for each token over all ate the need for explicit representations of syn- the tokens in the sentence: tax, or the model still benefits from syntactic in- (j) (j) (j)T formation in addition to the rich ELMo encodings. A = softmax(d−0:5Q K ) (2) In x2.2 and x2.3 we summarize how GloVe and h k h h ELMo embeddings, respectively, are incorporated These attention weights are then multiplied by into this model. (j) Vh for each token to obtain the self-attended to- (j) 2.1 LISA SRL model ken representations Mh : 2.1.1 Neural network token encoder (j) (j) (j) M = A V (3) The input to LISA is a sequence X of T token h h h representations x . The exact form of these rep- (j) t Row t of M , the self-attended representation for resentations when using GloVe embeddings is de- h token t at layer j, is thus the weighted sum with scribed in x2.2, and for ELMo described in x2.3. (j) respect to t (given by Ah ) over the token repre- Following Vaswani et al.(2017), we add a sinu- (j) soidal positional encoding to these vectors since sentations in Vh . The representations for each the self-attention has no inherent mechanism for attention head are concatenated, and this represen- modeling token position. tation is fed through a convolutional layer to pro- (j) These token representations are supplied to a se- duce st . In all of our models, we use J = 4, ries of J multi-head self-attention layers similar to H = 8 and dk = dq = dv = 64. those that make up the encoder model of Vaswani 2.1.2 Incorporating syntax et al.(2017). We denote the jth layer with the function S(j)(·) and the output of that layer for to- LISA incorporates syntax by training one atten- (j) tion head to attend to each token’s parent in a syn- ken t as st : tactic dependency parse tree. At layer jp, H − 1 (j) (j) (j−1) heads are left to learn on their own to attend to rel- st = S (st ) (1) evant tokens in the sentence, while one head hp is Each S(j)(·) consists of two components: (a) trained with an auxiliary objective which encour- multi-head self-attention and (b) a convolutional ages the head to put all attention weight on each to- layer. For brevity, we will detail (a) here as it is (jp) ken’s syntactic parent. Denoting the entry of Ahp how we incorporate syntax into the model, but we corresponding to the attention from token t to to- leave the reader to refer to Strubell et al.(2018) for ken q as atq, then we model the probability that q more details on (b). is the head of t as: P (q = head(t) j X ) = atq. The multi-head self attention consists of H at- (jp) Trained in this way, Ahp emits a directed graph, tention heads, each of which learns a distinct at- where each token’s syntactic parent is that which tention function to attend to all of the tokens in is assigned the highest attention weight. During the sequence. This self-attention is performed for training, this head’s attention weights are set to each token for each head, and the results of the H (jp) match the gold parse: Ahp is set to the adja- self-attentions are concatenated to form the final 1 self-attended representation for each token. cency matrix of the parse tree, allowing down- Specifically, consider the matrix S(j−1) of T to- stream layers to learn to use the parse informa- ken representations at layer j − 1. For each atten- tion throughout training. In our experiments we = 3 tion head h, we project this matrix into distinct set jp . (j) (j) 1 key, value and query representations Kh , Vh Roots are represented by self-loops. (jp) CBOW (Mikolov et al., 2013) algorithms, is a In this way, LISA is trained to use Ahp as an oracle providing parse information to down- shallow, log-bilinear embedding model for learn- stream layers. This representation is flexible, al- ing unsupervised representations of words based lowing LISA to use its own predicted parse, or on the intuition that words which occur in sim- a parse produced by another dependency parser. ilar contexts should have similar representations. Since LISA is trained to leverage gold parse in- GloVe Vectors are learned for each word in a fixed formation, as higher-accuracy dependency parses vocabulary by regressing on entries in the word become available, they can be provided to LISA to co-occurrence matrix constructed from a large cor- improve SRL without requiring re-training of the pus: The dot product between two words’ em- SRL model. beddings should equal the log probability of the words’ co-occurrence in the data. We refer the 2.1.3 Predicting POS and predicates reader to Pennington et al.(2014) for a more de- LISA is also trained to predict parts-of-speech and tailed description of the model. predicates using hard parameter sharing (Caru- We incorporate pre-trained GloVe embeddings ana, 1993). At layer jpos, the token representa- into our model following Strubell et al.(2018): We (jpos) tion st is provided as features for a multi- fix the pre-trained embeddings and add a learned class classifier into the joint label space of part- word embedding representation to the pre-trained of-speech and (binary) predicate labels: For each word vectors, following the intuition that fixing part-of-speech tag which is the tag for a predi- the pre-trained embeddings and learning a residual cate in the training data, we add a tag of the form word representation keeps words observed during TAG:PREDICATE.