Semi-Supervised Parsing with a Variational Autoencoding Parser
Total Page:16
File Type:pdf, Size:1020Kb
Semi-supervised Parsing with Variational Autoencoding Parser Xiao Zhang Dan Goldwasser Department of Computer Science Department of Computer Science Purdue University Purdue University [email protected] [email protected] root Abstract pu We propose an end-to-end variational au- nsubj toencoding parsing (VAP) model for semi- poss advmod xcomp dobj supervised graph-based projective dependency parsing. It encodes the input using continuous PRP$ NN RB VBZ VBG NN PUNC latent variables in a sequential manner by deep My dog also likes eating sausage . neural networks (DNN) that can utilize the contextual information, and reconstruct the in- Figure 1: A dependency tree: directional arcs represent put using a generative model. The VAP model head-modifier relation between words. admits a unified structure with different loss parsing aims to alleviate this problem by combin- functions for labeled and unlabeled data with ing a small amount of labeled data and a large shared parameters. We conducted experiments amount of unlabeled data, to improve parsing per- on the WSJ data sets, showing the proposed model can use the unlabeled data to increase formance on using labeled data alone. Traditional the performance on a limited amount of la- semi-supervised parsers use unlabeled data to gen- beled data, on a par with a recently proposed erate additional features in order to assist the learn- semi-supervised parser with faster inference. ing process (Koo et al., 2008), together with differ- ent variants of self-training (Søgaard, 2010). How- 1 Introduction ever, these approaches are usually pipe-lined and Dependency parsing captures bi-lexical relation- error-propagation may occur. ships by constructing directional arcs between In this paper, we propose Variational Autoen- words, defining a head-modifier syntactic struc- coding Parser, or VAP, extends the idea of VAE, ture for sentences, as shown in Figure1. De- illustrated in Figure3. The VAP model uses unla- pendency trees are fundamental for many down- beled examples to learn continuous latent variables stream tasks such as semantic parsing (Reddy of the sentence, which can be used to support tree et al., 2016; Marcheggiani and Titov, 2017), ma- inference by providing an enriched representation. chine translation (Bastings et al., 2017; Ding and We summarize our contributions as follows: Palmer, 2007), information extraction (Culotta and 1. We proposed a Variational Autoencoding Parser Sorensen, 2004; Liu et al., 2015) and question (VAP) for semi-supervised dependency parsing; answering (Cui et al., 2005). Recently, efficient 2. We designed a unified loss function for the pro- parsers (Kiperwasser and Goldberg, 2016; Dozat posed parser to deal with both labeled and unla- and Manning, 2017; Dozat et al., 2017; Ma et al., beled data. 2018) have been developed using various neural 3. We show improved performance of the proposed architectures. model with unlabeled data on the WSJ data While supervised approaches have been very suc- sets, and the performance is on a par with a re- cessful, they require large amounts of labeled data, cently proposed semi-supervised parser (Corro particularly when neural architectures are used, and Titov, 2019), with faster inference. which usually are over-parameterized. Syntactic 2 Related Work annotation is notoriously difficult and requires spe- cialized linguistic expertise, posing a serious chal- Most dependency parsing studies fall into two lenge for low-resource languages. Semi-supervised major groups: graph-based and transition-based 40 Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task, pages 40–47 Virtual Meeting, July 9, 2020. c 2020 Association for Computational Linguistics (Kubler et al., 2009). Graph-based parsers (Mc- Our graph-based VAP parser is constructed Donald, 2006) regard parsing as a structured predic- based on the following standard structured predic- tion problem to find the most probable tree, while tion paradigm (McDonald et al., 2005; Taskar et al., transition-based parsers (Nivre, 2004, 2008) treat 2005). In inference, based on the scoring func- parsing as a sequence of actions at different stages tion SΛ with parameter Λ, the parsing problem is leading to a dependency tree. formulated as finding the most probable directed While earlier works relied on manual feature spanning tree for a given sentence x: engineering, in recent years the hand-crafted fea- ∗ ~ tures were replaced by embeddings and deep neural T = arg max SΛ(x; T ); T~ 2T network architectures were used to learn represen- ∗ tation for scoring structural decisions, leading to where T is the highest scoring parse tree and T is improved performance in both graph-based and the set of all valid trees for the sentence x. transition-based parsing (Nivre, 2014; Pei et al., It is common to factorize the score of the entire 2015; Chen and Manning, 2014; Dyer et al., 2015; graph into the summation of its substructures–the Weiss et al., 2015; Andor et al., 2016; Kiperwasser individual arc scores (McDonald et al., 2005): and Goldberg, 2016; Wiseman and Rush, 2016). l X X The annotation difficulty for this task, has also SΛ(x; T~) = sΛ(h; m) = sΛ(ht; mt); motivated work on unsupervised (grammar in- (h;m)2T~ t=1 duction) and semi-supervised approaches to pars- ing (Tu and Honavar, 2012; Jiang et al., 2016; Koo where T~ represents the candidate parse tree, and et al., 2008; Li et al., 2014; Kiperwasser and Gold- sΛ is a function scoring individual arcs. sΛ(h; m) berg, 2015; Cai et al., 2017; Corro and Titov, 2019). describes the likelihood of an arc from the head h It also leads to advances in using unlabeled data for to its modifier m in the tree. Throughout this paper, constituent grammar (Shen et al., 2018b,a) the scoring is based on individual arcs, as we focus Similar to other structured prediction tasks, di- on first-order parsing. rectly optimizing the objective is difficult when the underlying probabilistic model requires marginal- 3.1 Scoring Function Using Neural izing over the dependency trees. Variational ap- Architecture proaches are a natural way to alleviate this diffi- We used the same neural architecture as that in culty, as they try to improve the lower bound of the Kiperwasser and Goldberg(2016)’s study. We first original objective, and have been applied in sev- use a bi-LSTM model to take as input ut = [pt; et] eral recent NLP works (Stratos, 2019; Chen et al., at position t to incorporate contextual information, 2018; Kim et al., 2019b,a). Variational Autoen- by feeding the word embedding et concatenated coder (VAE) (Kingma and Welling, 2014) is partic- with the POS tag embeddings pt of each word. The ularly useful for latent representation learning, and bi-LSTM then projects ut as ot. has been studied in semi-supervised context as the Subsequently a nonlinear transformation is Conditional VAE (CVAE) (Sohn et al., 2015). Note applied on these projections. Suppose the our work differs from VAE as VAE is designed for hidden states generated by the bi-LSTM are tabular data but not for structured prediction, as [oroot; o1; o2;:::; ot;:::; ol], for a sentence of the input towards VAP is the sequence of sentential length l, we compute the arc scores by introduc- tokens and the output is the dependency tree. ing parameters Wh, Wm, w and b, and transform them as follows: 3 Graph-based Dependency Parsing h−arc m−arc rt = Whot; rt = Wmot; A dependency graph of a sentence can be regarded | h−arc m−arc sΛ(h; m) = w (tanh(rh + rm + b)): as a directed tree spanning all the words of the sentence, including a special “word”–the ROOT– In this formulation, we first use two parameters to originate out. Assuming a sentence of length to extract two different representations that carry l, a dependency tree can be denoted as T = (< two different types of information: a head seeking h1; m1 >; : : : ; < hl−1; ml−1 >); where ht is the for its modifier (h-arc) and a modifier seeking for index in the sequence of the head word of the de- its head (m-arc). Then a nonlinear function maps pendency connecting the tth word mt as a modifier. them to an arc score. 41 Root X1 … Xt … XL-1 latent variable retains the contextual information Root 0 The score of the (t, t-1) right arc from lower-level neural models to assist finding its head or its modifier; as well as forcing the repre- X1 0 sentation of similar tokens to be closer. The latent … 0 variable group z is modeled via P (zjx). In addi- tion, we model the process of reconstructing the 0 S(t, t-1) input sentence from the latent variable through a Xt S(t-1, t) 0 generative story P (xjz). … We adjust the original VAE setup in our semi- 0 supervised task by considering examples with la- XL-1 The score of the (t-1, t) left arc 0 bels, similar to recent conditional variational for- mulations (Sohn et al., 2015; Miao and Blunsom, Figure 2: In this illustration of the arc scoring matrix, 2016; Zhou and Neubig, 2017). We propose a full each entry represents the (h(head) ! m(modifier)) probabilistic model for a given sentence x, with the score. unified objective to maximize for both supervised and unsupervised parsing as follows: ˆ xˆt 1 xˆt xˆt+1 X −x ( Decoder 1; if T exists; J = log Pθ(x)P!(T jx); = 0; otherwise: yt 1 yt yt+1 Y −z This objective can be interpreted as follows: if Encoder the training example has a golden tree T with X x x it, then the objective is the log joint probability Pθ;!(T ; x); if the golden tree is missing, then the (a) VAE (b) VAP objective is the log marginal probability Pθ(x).