Dual-View Variational Autoencoders for Semi-Supervised Text Matching

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Dual-View Variational Autoencoders for Semi-Supervised Text Matching Zhongbin Xie and Shuai Ma SKLSDE Lab, Beihang University, Beijing, China Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing, China fxiezb, [email protected] sentencesentence embeddingsembeddings Abstract Sentence 1 (embedding(embedding view)view) Semantically matching two text sequences (usu- walk together down a street matching ally two sentences) is a fundamental problem in encoder Sentence 2 degree NLP. Most previous methods either encode each walking hand in hand of the two sentences into a vector representation (sentence-level embedding) or leverage word-level (a) sentence encoding-based model interaction features between the two sentences. In Sentence 1 interactioninteraction featuresfeatures this study, we propose to take the sentence-level walk together down a street (interaction(interaction view)view) walking embedding features and the word-level interac- 2 tion features as two distinct views of a sentence hand matching pair, and unify them with a framework of Varia- in degree entence S tional Autoencoders such that the sentence pair is hand matched in a semi-supervised manner. The pro- word alignment methods posed model is referred to as Dual-View Variation- (b) sentence pair interaction model al AutoEncoder (DV-VAE), where the optimization of the variational lower bound can be interpreted as Figure 1: Two categories of text matching models: They leverage an implicit Co-Training mechanism for two match- sentence embeddings and word-level interaction features, respec- ing models over distinct views. Experiments on tively, which can be seen as two distinct views of the sentence pair. SNLI, Quora and a Community Question Answer- ing dataset demonstrate the superiority of our DV- VAE over several strong semi-supervised and su- sentence pair interaction models, which use some sorts of pervised text matching models. word alignment methods, such as interaction matrices [Gong et al., 2018; Wu et al., 2018] or attention mechanisms [Rock- 1 Introduction taschel¨ et al., 2016; Wang and Jiang, 2017], to obtain fine- The need to semantically match two text sequences arises grained interaction features for predicting the matching de- in many Natural Language Processing problems, where a gree. Sentence encoding-based models leverage global sen- central task is to compute the matching degree between t- tence representations with high-level semantic features, while wo sentences and determine their semantic relationship. For sentence pair interaction models leverage word-by-word in- instance, in Paraphrase Identification [Dolan and Brockett, teraction features containing local matching patterns, as il- 2005], whether one sentence is a paraphrase of another has lustrated in Figure 1. to be determined; In Question Answering [Yang et al., 2015], With the recent advances in deep generative models, some a matching score is calculated for a question and each of its studies begin to employ variational autoencoders (VAEs) [K- candidate answers for making decisions; And in Natural Lan- ingma and Welling, 2014] to learn informative sentence em- guage Inference [Bowman et al., 2015], the relationship be- beddings for various downstream NLP problems, including tween a premise and a hypothesis is classified as entailment, text matching [Bowman et al., 2016b; Zhao et al., 2018; neutral or contradiction. Shen et al., 2018]. They leverage a VAE to encode sen- Most previous studies on text matching focus on devel- tences into latent codes, which are used as sentence embed- oping supervised models with deep neural networks. These dings for a sentence encoding-based matching model. The models can be essentially divided into two categories: (i) sen- VAE and the matching model can be jointly trained in a semi- tence encoding-based models, which separately encode each supervised manner, leveraging large amounts of unlabeled da- of the two sentences into a vector representation (sentence ta to improve matching performance. However, these models embedding) and then match between the two vectors [Bow- are limited to global semantic features in the sentence em- man et al., 2016a; Mueller and Thyagarajan, 2016], and (ii) beddings, leaving out the word-level interaction features that 5306 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) have been proved important for predicting matching degrees matching from in the supervised case [Lan and Xu, 2018]. z1 embedding view z2 Motivated by these observations, we propose to unify the p(y|z1,z2) sentence-level embedding features and the word-level in- q(z2|x2) teraction features within a variational autoencoder, leverag- q(z1|x1) p(x1|z1) y p(x2|z2) ing both labeled and unlabeled sentence pairs in a semi- supervised manner for text matching. We take inspiration q(y|x1,x2) from Co-Training [Blum and Mitchell, 1998], where two matching from x1 x2 classifiers over two distinct views of the data examples are interaction view trained to produce consistent predictions on the unlabeled data. For a sentence pair, the aforementioned two levels of fea- Figure 2: Probabilistic graphical model of DV-VAE. Solid lines de- tures are taken as two distinct views, namely the embedding note the generative model, and dashed lines denote the inference view and the interaction view. The proposed model is denoted model. Shaded x1; x2 are observed variables; z1; z2 are latent vari- Dual-View Variational AutoEncoder (DV-VAE) (Figure 2). ables; and y is a semi-observed variable. In the generative process, two sentences are generated from two latent variables, respectively. The matching degree is al- x and x through a decoder p (xjz). Latent codes z ; z so generated from these two latent variables, treated as the 1 2 θ 1 2 are also fed into a sentence encoding-based matching mod- embedding view, through a sentence encoding-based model. el p (yjz ; z ) to generate the matching degree y. The joint In the inference process, the matching degree is inferred from θ 1 2 distribution can be explained by the following factorization: the interaction view through a sentence pair interaction model. During the optimization of the variational lower bound, pθ(x1; x2; y; z1; z2) the two matching models implicitly provide pseudo labels on = p(z )p (x jz )p(z )p (x jz )p (yjz ; z ); unlabeled data for each other, which can be interpreted as an 1 θ 1 1 2 θ 2 2 θ 1 2 implicit Co-Training mechanism. where p(z1) = p(z2) = p(z) = N (z; 0; I) is a Gaussian pri- Our contributions are as follows: (i) We propose Dual- or. And pθ(yjz1; z2) is referred to as the embedding matcher View Variational AutoEncoder (DV-VAE) to unify the em- as it matches from the embedding view (latent space). bedding view and the interaction view of a sentence pair for semi-supervised text matching. An implicit Co-Training Inference Model. According to the conditional indepen- mechanism is also formulated to interpret the training pro- dence properties in the generative model, the variational pos- cess. (ii) We instantiate an implementation of DV-VAE terior qφ(z1; z2; yjx1; x2) can be factorized as: and adopt a novel sentence pair interaction matching mod- qφ(z1; z2; yjx1; x2) = qφ(z1jx1)qφ(z2jx2)qφ(yjz1; z2) el, where interaction matrices across words and contexts are = q (z jx )q (z jx )p (yjz ; z ); (1) introduced to enrich the interaction features. (iii) Using three φ 1 1 φ 2 2 θ 1 2 datasets: SNLI, Quora and a Community QA dataset, we em- where we model qφ(yjz1; z2) by the embedding matcher pirically demonstrate the superiority of DV-VAE over several pθ(yjz1; z2). qφ(z1; z2; yjx1; x2) can also be factorized as: strong semi-supervised and supervised baselines. qφ(z1; z2; yjx1; x2)=qφ(yjx1; x2)qφ(z1; z2jx1; x2; y); (2) 2 Dual-View Variational Antoencoder where we model qφ(yjx1; x2) by a sentence pair interaction Suppose that we have a labeled sentence pair set Dl and an matching model to match from the interaction view. Thus unlabeled sentence pair set Du. (x1; x2; y) 2 Dl denotes a qφ(yjx1; x2) is referred to as the interaction matcher and is labeled sentence pair, where x1, x2 are two sentences and adopted to make predictions at test time. In analogy to Co- y 2 f1; 2;:::;Cg is the matching degree of x1 and x2. Here Training [Blum and Mitchell, 1998], we assume that each y is discretized and text matching is treated as a classifica- of the embedding view and the interaction view is suffi- tion problem. Similarly, (x1; x2) 2 Du denotes an unlabeled cient to train the corresponding matcher, and the predictions pair. Our goal is to develop a semi-supervised text matching from the two matchers are consistent in the inference process: model using both the labeled and unlabeled data Dl and Du, qφ(yjx1; x2) = pθ(yjz1; z2). With this consistency assump- which can improve upon the performance of supervised text tion, we obtain the following from Equ (1) and Equ (2): matching models using the labeled data Dl only. qφ(z1; z2jx1; x2; y) = qφ(z1jx1)qφ(z2jx2); 2.1 Model Architecture qφ(z1; z2; yjx1; x2) = qφ(yjx1; x2)qφ(z1jx1)qφ(z2jx2); The probabilistic graphical model of DV-VAE is shown in which are taken as the inference model in labeled Figure 2. It consists of a generative model matching from the and unlabeled cases, respectively. Here encoders embedding view and an inference model matching from the 2 qφ(z1jx1) = N (z1; µφ(x1); diag(σφ(x1))) and qφ(z2jx2) = interaction view. 2 N (z2; µφ(x2); diag(σφ(x2))) are diagonal Gaussians.

Load more