Improving Sequence-To-Sequence Constituency Parsing

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) Improving Sequence-to-Sequence Constituency Parsing Lemao Liu, Muhua Zhu, Shuming Shi Tencent AI Lab, Shenzhen, China {redmondliu,muhuazhu, shumingshi}@tencent.com Abstract Sequence-to-sequence constituency parsing casts the tree- structured prediction problem as a general sequential problem by top-down tree linearization, and thus it is very easy to train in parallel with distributed facilities. Despite its suc- cess, it relies on a probabilistic attention mechanism for a general purpose, which can not guarantee the selected context to be informative in the specific parsing scenario. Previ- ous work introduced a deterministic attention to select the informative context for sequence-to-sequence parsing, but it is based on the bottom-up linearization even if it was observed that top-down linearization is better than bottom-up linearization for standard sequence-to-sequence constituency parsing. In this paper, we thereby extend the deterministic attention Figure 1: Top-down and bottom-up linearlization of a parse to directly conduct on the top-down tree linearization. Inten- tree in sequence-to-sequence constituency parsing. The in- sive experiments show that our parser delivers substantial im- put sequence x is the leaves of the parse tree in the top, and provements over the bottom-up linearization in accuracy, and the output is the linearized sequence y in the bottom. A dash it achieves 92.3 Fscore on the Penn English Treebank sec- tion 23 and 85.4 Fscore on the Penn Chinese Treebank test line indicates the relation between xi and “XX”. dataset, without reranking or semi-supervised training. to input tokens to select relevant context for better predic- Introduction tion as shown in Figure 2(a). This parsing model is gen- Constituency parsing is a fundamental task in natural lan- eral and easy to understand; particularly it runs in a se- guage processing, and it plays an important role in down- quential manner and thus is easy to parallelize with GPUs. stream applications such as machine translation (Galley et However, the probabilistic attention can not guarantee the al. 2004; 2006) and semantic analysis (Rim, Seo, and Sim- selected context is informative enough to yield satisfac- mons 1990; Manning, Schutze,¨ and others 1999). Over the tory outputs. As a result, its accuracy is only comparable decades, feature-rich linear models had been dominant in to the feature-rich linear models (Petrov and Klein 2007; constituency parsing (Petrov and Klein 2007; Zhu et al. Zhu et al. 2013), especially given that it utilizes global con- 2013); but they are not good at capturing the long dis- text. tance dependencies due to feature sparsity. Recurrent neu- Ma et al. (2017) proposed a deterministic attention for se- ral networks have the advantages to address such issue, quence to sequence parsing, which defines the alignments and recently there has been much work on recurrent neu- between output and input tokens in a deterministic manner ral models for constituency parsing (Vinyals et al. 2015; to select the relevant context. This method was able to se- Watanabe and Sumita 2015; Dyer et al. 2016; Cross and lect better context than probabilistic attention for parsing. Huang 2016). However, their approach was conducted on the bottom up In particular, sequence-to-sequence parsing (Vinyals et al. linearization (see its linearized sequence in Figure 1) and 2015) has been increasingly popular. Its basic idea is to they require to binarize a parse tree, which induces the is- linearize a parse tree into a sequence in a top-down man- sue of ambiguity: different binarized trees may lead to the ner (see Figure 1) and then transform parsing into a stan- same tree. In addition, the bottom-up linearization lacks of dard sequence-to-sequence learning task. The main tech- top-down guidance such as lookahead information, which nique inside the sequence-to-sequence parsing is a proba- has been proved to be useful for better prediction (Roark bilistic attention mechanism, which aligns an output token and Johnson 1999; Liu and Zhang 2017b). As a result, their Copyright c 2018, Association for the Advancement of Artificial parser is still worse than the state of the art parsers in accu- Intelligence (www.aaai.org). All rights reserved. racy. 4873 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 John has a dog . John has a dog . =5 =5 t t P(1, 3) P(3, ?) αt b5 =1 s5 = 3 y1 y2 y3 y4 y5 y1 y2 y3 y4 y5 (S (NP XX XX (S (NP XX XX x1 x2 x1 x2 (a). Probabilistic attention (b). Deterministic attention Figure 2: The figures for probabilistic attention (a) and deterministic attention (b). At the timestep t =5, y<5 have been available but yt is unavailable and will be predicted next using context by attentions. Probabilistic attention aligns y5 to all tokens according to a distribution αt shown in dotted arrow lines, while deterministic attention aligns y5 to the phrase P(1, 3), the semi-phrase P(3, ?) and x3 in a deterministic manner solid shown in arrow lines. In this paper, therefore, we aim to explore the determinis- Sequence-to-Sequence Parsing tic attention directly on top of top-down linearization, with Suppose x = x1,x2, ···,x|x| denotes an input sequence the expectation to improve the sequence-to-sequence con- with length |x|; and y = y1,y2, ··· ,y|y| denotes the out- stituency parsing. The proposed deterministic attention is input sequence which represents a linearized parse tree of x spired by the following intuition. When linearizing a parse via a linearization method such as top-down linearization. tree in a top-down manner, it is clear that each token “XX” In Figure 1, x is the sequence John, has, a, dot, . and y is represents a known word in the input side by a dash line its linearized tree sequence (S, (NP, XX, ··· , XX, )S. as shown in Figure 1, and thus this output token might de- Generally, sequence-to-sequence constituency parsing di- terministically align to that specific token in the input side rectly maps an input sequence to its linearized parse tree se- rather than stochastically align to all input tokens. Respect- quence by using a neural machine translation model (NMT) ing this intuition, we analyze the ideal alignment situations (Vinyals et al. 2015; Bahdanau, Cho, and Bengio 2014). at each decoding timestep and propose a general determin- NMT relies on recurrent neural networks under the encode- istic attention criteria for context selection. Under this cri- decode framework including two stages. In the encoding teria, we propose some simple instances to deterministi- stage, it applies recurrent neural networks to represent x as cally specify the alignments between input and output to- a sequence of vectors: kens (§3). Since our deterministic attention sequentially rep- = ( ) resents alignments for a given parse tree, its training still hi f exi ,hi−1 , performs in a sequential manner and thus is easy to par- where the hidden unit hi is a vector with dimension d at allelize as the standard sequence-to-sequence parsing does. timestep i, f is a recurrent function such as LSTM and GRU Empirical experiments demonstrate that the resulting deter- and ex denotes the word embedding of x. Suppose the en- ministic attention on top-down linearization achieves sub- x = x x ··· x stantial gains over the model in Ma et al. (2017). Further- coding sequence is denoted by E E1 ,E2 , ,E|x| . x more, with the help of ensemble, the proposed parser is com- Then each Ei can be set as hi from a reversed recurrent petitive to state-of-the-art RNN parsers (Dyer et al. 2016; neural network (Vinyals et al. 2015) or as the concatenation Stern, Andreas, and Klein 2017), which require to maintain of the hidden units from bidirectional recurrent neural net- tree structures and thus are not easy to parallelize for train- works (Ma et al. 2017). ing. In the decoding stage, it generates a linearized sequence This paper makes the following contributions: from the conditional probability distribution defined by a recurrent neural network as follows: • It analyzes the deterministic attention for sequence-to- |y| x sequence parsing on top of top-down linearization, and p(y | x; θ)= p(yt | y<t,E ) proposes a simple yet effective model without increasing t=1 the training time. |y| = softmax g(ey ,h ,ct) [yt], (1) • t−1 t On both Penn English and Chinese Treebank datasets, in- t=1 tensive experiments show that our parser outperforms sev- eral direct sequence-to-sequence baselines, and achieve with = ( ) 92.3 Fscore on English dataset and 85.4 Fscore on Chi- ht f ht−1,yt−1,ct nese dataset without reranking or semi-supervised train- where θ is the overall parameter of this model; y<t = ing. y1,y2, ··· ,yt−1; ey denotes the embedding of y; g is a 4874 projection function mapping into a vector with dimension of starting at “has”, and thus y5 may align to this noun phrase. the output vocabulary size V ; and [i] denotes the ith compo- As the stopping position of this phrase is unknown, we call it nent of a vector; ct is a context vector at timestep i and ht is a semi-phrase represented by P(bt, ?) throughout this paper. a hidden unit specified by a RNN unit f similar to f defined In this case, b5 =2. in encoding stage. The above analysis is the exact case in the training stage, As shown in Eq.(1), ct is used to not only update the hid- where the entire tree sequence y is given in Figure 1.

Load more