An Improved Neural Network Model for Joint POS Tagging and Dependency Parsing
Total Page:16
File Type:pdf, Size:1020Kb
An improved neural network model for joint POS tagging and dependency parsing Dat Quoc Nguyen and Karin Verspoor School of Computing and Information Systems The University of Melbourne, Australia fdqnguyen, [email protected] Abstract (Reddy et al., 2016) and machine translation (Gal- ley and Manning, 2009). In general, dependency We propose a novel neural network model parsing models can be categorized as graph-based for joint part-of-speech (POS) tagging and (McDonald et al., 2005) and transition-based (Ya- dependency parsing. Our model extends mada and Matsumoto, 2003; Nivre, 2003). Most the well-known BIST graph-based depen- traditional graph- or transition-based models de- dency parser (Kiperwasser and Goldberg, fine a set of core and combined features (McDon- 2016) by incorporating a BiLSTM-based ald and Pereira, 2006; Nivre et al., 2007b; Bohnet, tagging component to produce automati- 2010; Zhang and Nivre, 2011), while recent state- cally predicted POS tags for the parser. On of-the-art models propose neural network archi- the benchmark English Penn treebank, our tectures to handle feature-engineering (Dyer et al., model obtains strong UAS and LAS scores 2015; Kiperwasser and Goldberg, 2016; Dozat and at 94.51% and 92.87%, respectively, pro- Manning, 2017; Ma and Hovy, 2017). ducing 1.5+% absolute improvements to Most traditional and neural network-based pars- the BIST graph-based parser, and also ob- ing models use automatically predicted POS tags taining a state-of-the-art POS tagging ac- as essential features. However, POS taggers are curacy at 97.97%. Furthermore, experi- not perfect, resulting in error propagation prob- mental results on parsing 61 “big” Univer- lems. Some work has attempted to avoid using sal Dependencies treebanks from raw texts POS tags for dependency parsing (Dyer et al., show that our model outperforms the base- 2015; Ballesteros et al., 2015; de Lhoneux et al., line UDPipe (Straka and Strakova´, 2017) 2017), however, to achieve the strongest parsing with 0.8% higher average POS tagging scores these methods still require automatically score and 3.6% higher average LAS score. assigned POS tags. Alternatively, joint POS tag- In addition, with our model, we also obtain ging and dependency parsing has also attracted a state-of-the-art downstream task scores for lot of attention in NLP community as it could help biomedical event extraction and opinion improve both tagging and parsing results over in- analysis applications. dependent modeling (Li et al., 2011; Hatori et al., Our code is available together with all pre- 2011; Lee et al., 2011; Bohnet and Nivre, 2012; trained models at: https://github. Zhang et al., 2015; Zhang and Weiss, 2016; Yang com/datquocnguyen/jPTDP. et al., 2018). In this paper, we present a novel neural 1 Introduction network-based model for jointly learning POS tag- ging and dependency paring. Our joint model ex- Dependency parsing – a key research topic in nat- tends the well-known BIST graph-based depen- ural language processing (NLP) in the last decade dency parser (Kiperwasser and Goldberg, 2016) (Buchholz and Marsi, 2006; Nivre et al., 2007a; with an additional lower-level BiLSTM-based tag- Kubler¨ et al., 2009) – has also been demonstrated ging component. In particular, this tagging com- to be extremely useful in many applications such ponent generates predicted POS tags for the parser as relation extraction (Culotta and Sorensen, 2004; component. Evaluated on the benchmark English Bunescu and Mooney, 2005), semantic parsing Penn treebank test Section 23, our model pro- 81 Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 81–91 Brussels, Belgium, October 31 – November 1, 2018. c 2018 Association for Computational Linguistics https://doi.org/10.18653/v1/K18-2008 nsubj cop punct softmax softmax softmax Parsing MLP MLP MLP component LSTM LSTM LSTM LSTM BiLSTM PRON VERB ADJ PUNCT POS tagging softmax softmax softmax softmax component MLP MLP MLP MLP LSTM LSTM LSTM LSTM Word vector representation we are finished . nsubj Character-level word embedding cop punct we are finished . Word embedding PRON VERB ADJ PUNCT Figure 1: Illustration of our new model for joint POS tagging and graph-based dependency parsing. duces a 1.5+% absolute improvement over the hidden layer (MLP) to predict POS tags. The pars- BIST graph-based parser with a strong UAS score ing component then uses another BiLSTM to learn of 94.51% and LAS score of 92.87%; and also ob- another set of latent feature representations, based taining a state-of-the-art POS tagging accuracy of on both the input word tokens and the predicted 97.97%. In addition, multilingual parsing exper- POS tags. These latent feature representations are iments from raw texts on 61 “big” Universal De- fed into a MLP to decode dependency arcs and an- pendencies treebanks (Zeman et al., 2018) show other MLP to label the predicted dependency arcs. that our model outperforms the baseline UDPipe (Straka and Strakova´, 2017) with 0.8% higher av- 2.1 Word vector representation erage POS tagging score, 3.1% higher UAS and Given an input sentence s consisting of n word th 3.6% higher LAS. Furthermore, experimental re- tokens w1; w2; :::; wn, we represent each i word sults on downstream task applications (Fares et al., wi in s by a vector ei. We obtain ei by concate- 2018) show that our joint model helps produce (W) nating word embedding ewi and character-level state-of-the-art scores for biomedical event extrac- (C) word embedding ewi : tion and opinion analysis. (W) (C) ei = ew ◦ ew (1) 2 Our joint model i i Here, each word type w in the training data is rep- (W) This section presents our model for joint POS tag- resented by a real-valued word embedding ew . ging and graph-based dependency parsing. Fig- Given the word type w consisting of k charac- th ure1 illustrates the architecture of our joint model ters w = c1c2:::ck where each j character in w which can be viewed as a two-component mix- is represented by a character embedding cj, we ture of a tagging component and a parsing compo- use a sequence BiLSTM (BiLSTMseq) to learn its nent. Given word tokens in an input sentence, the character-level vector representation (Ballesteros tagging component uses a BiLSTM to learn “la- et al., 2015; Plank et al., 2016). The input to tent” feature vectors representing these word to- BiLSTMseq is the sequence of k character em- kens. Then the tagging component feeds these fea- beddings c1:k, and the output is a concatenation ture vectors into a multilayer perceptron with one of outputs of a forward LSTM (LSTMf) reading 82 the input in its regular order and a reverse LSTM et al., 2009). Here, we score an arc by using a (LSTMr) reading the input in reverse: MLP with a one-node output layer (MLParc) on (C) top of the BiLSTMdep: ew = BiLSTMseq(c1:k) = LSTMf(c1:k) ◦ LSTMr(ck:1) scorearc(i; j) (6) 2.2 Tagging component = MLParc vi ◦ vj ◦ (vi ∗ vj) ◦ jvi − vjj We feed the sequence of vectors e with an ad- 1:n where (v ∗ v ) and jv − v j denote the element- ditional context position index i into another BiL- i j i j wise product and the absolute element-wise differ- STM (BiLSTM ), resulting in latent feature vec- pos v v (pos) th ence, respectively; and i and j are correspond- tors vi each representing the i word wi in s: ingly the latent feature vectors associating to the th th (pos) i and j words in s, computed by Equation5. vi = BiLSTMpos(e1:n; i) (2) Given the arc scores, we use the Eisner(1996)’s We use a MLP with softmax output (MLPpos) decoding algorithm to find the highest scoring pro- on top of the BiLSTMpos to predict POS tag of jective parse tree: each word in s. The number of nodes in the output X score(s) = argmax scorearc(h; m) (7) layer of this MLPpos is the number of POS tags. y^2Y(s) (pos) (h;m)2y^ Given vi , we compute an output vector as: where Y(s) is the set of all possible dependency (pos) trees for the input sentence s while scorearc(h; m) #i = MLPpos(vi ) (3) measures the score of the arc between the head hth Based on output vectors #i, we then com- word and the modifier mth word in s. ^ pute the cross-entropy objective loss LPOS(t; t), Following Kiperwasser and Goldberg(2016), in which ^t and t are the sequence of predicted POS we compute a margin-based hinge loss LARC with tags and sequence of gold POS tags of words in the loss-augmented inference to maximize the mar- input sentence s, respectively (Goldberg, 2016). gin between the gold unlabeled parse tree and the Our tagging component thus can be viewed as a highest scoring incorrect tree. simplified version of the POS tagging model pro- For predicting dependency relation type of a posed by Plank et al.(2016), without their addi- head-modifier arc, we use another MLP with soft- tional auxiliary loss for rare words. max output (MLPrel) on top of the BiLSTMdep. Here, the number of the nodes in the output layer 2.3 Parsing component of this MLPrel is the number of dependency rela- Assume that p1; p2; :::; pn are the predicted POS tion types. Given an arc (h; m), we compute an tags produced by the tagging component for the output vector as: input words. We represent each ith predicted POS v (8) tag by a vector embedding e(P).