Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Dual-View Variational for Semi-Supervised Text Matching

Zhongbin Xie and Shuai Ma SKLSDE Lab, Beihang University, Beijing, China Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing, China {xiezb, mashuai}@buaa.edu.cn

sentencesentence embeddingsembeddings Abstract Sentence 1 (embedding(embedding view)view) Semantically matching two text sequences (usu- walk together down a street matching ally two sentences) is a fundamental problem in encoder Sentence 2 degree NLP. Most previous methods either encode each walking hand in hand of the two sentences into a vector representation (sentence-level embedding) or leverage word-level (a) sentence encoding-based model interaction features between the two sentences. In Sentence 1 interactioninteraction featuresfeatures this study, we propose to take the sentence-level walk together down a street (interaction(interaction view)view) embedding features and the word-level interac- walking tion features as two distinct views of a sentence hand matching pair, and unify them with a framework of Varia- in degree S entence 2 tional Autoencoders such that the sentence pair is hand matched in a semi-supervised manner. The pro- word alignment methods posed model is referred to as Dual-View Variation- (b) sentence pair interaction model al (DV-VAE), where the optimization of the variational lower bound can be interpreted as Figure 1: Two categories of text matching models: They leverage an implicit Co-Training mechanism for two match- sentence embeddings and word-level interaction features, respec- ing models over distinct views. Experiments on tively, which can be seen as two distinct views of the sentence pair. SNLI, Quora and a Community Question Answer- ing dataset demonstrate the superiority of our DV- VAE over several strong semi-supervised and su- sentence pair interaction models, which use some sorts of pervised text matching models. word alignment methods, such as interaction matrices [Gong et al., 2018; Wu et al., 2018] or attention mechanisms [Rock- 1 Introduction taschel¨ et al., 2016; Wang and Jiang, 2017], to obtain fine- The need to semantically match two text sequences arises grained interaction features for predicting the matching de- in many Natural Language Processing problems, where a gree. Sentence encoding-based models leverage global sen- central task is to compute the matching degree between t- tence representations with high-level semantic features, while wo sentences and determine their semantic relationship. For sentence pair interaction models leverage word-by-word in- instance, in Paraphrase Identification [Dolan and Brockett, teraction features containing local matching patterns, as il- 2005], whether one sentence is a paraphrase of another has lustrated in Figure 1. to be determined; In Question Answering [Yang et al., 2015], With the recent advances in deep generative models, some a matching score is calculated for a question and each of its studies begin to employ variational autoencoders (VAEs) [K- candidate answers for making decisions; And in Natural Lan- ingma and Welling, 2014] to learn informative sentence em- guage Inference [Bowman et al., 2015], the relationship be- beddings for various downstream NLP problems, including tween a premise and a hypothesis is classified as entailment, text matching [Bowman et al., 2016b; Zhao et al., 2018; neutral or contradiction. Shen et al., 2018]. They leverage a VAE to encode sen- Most previous studies on text matching focus on devel- tences into latent codes, which are used as sentence embed- oping supervised models with deep neural networks. These dings for a sentence encoding-based matching model. The models can be essentially divided into two categories: (i) sen- VAE and the matching model can be jointly trained in a semi- tence encoding-based models, which separately encode each supervised manner, leveraging large amounts of unlabeled da- of the two sentences into a vector representation (sentence ta to improve matching performance. However, these models embedding) and then match between the two vectors [Bow- are limited to global semantic features in the sentence em- man et al., 2016a; Mueller and Thyagarajan, 2016], and (ii) beddings, leaving out the word-level interaction features that

5306 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) have been proved important for predicting matching degrees matching from in the supervised case [Lan and Xu, 2018]. z1 embedding view z2 Motivated by these observations, we propose to unify the p(y|z1,z2) sentence-level embedding features and the word-level in- q(z2|x2) teraction features within a variational autoencoder, leverag- q(z1|x1) p(x1|z1) y p(x2|z2) ing both labeled and unlabeled sentence pairs in a semi- supervised manner for text matching. We take inspiration q(y|x1,x2) from Co-Training [Blum and Mitchell, 1998], where two matching from x1 x2 classifiers over two distinct views of the data examples are interaction view trained to produce consistent predictions on the unlabeled da- ta. For a sentence pair, the aforementioned two levels of fea- Figure 2: Probabilistic of DV-VAE. Solid lines de- tures are taken as two distinct views, namely the embedding note the generative model, and dashed lines denote the inference view and the interaction view. The proposed model is denoted model. Shaded x1, x2 are observed variables; z1, z2 are latent vari- Dual-View Variational AutoEncoder (DV-VAE) (Figure 2). ables; and y is a semi-observed variable. In the generative process, two sentences are generated from two latent variables, respectively. The matching degree is al- x and x through a decoder p (x|z). Latent codes z , z so generated from these two latent variables, treated as the 1 2 θ 1 2 are also fed into a sentence encoding-based matching mod- embedding view, through a sentence encoding-based model. el p (y|z , z ) to generate the matching degree y. The joint In the inference process, the matching degree is inferred from θ 1 2 distribution can be explained by the following factorization: the interaction view through a sentence pair interaction mod- el. During the optimization of the variational lower bound, pθ(x1, x2, y, z1, z2) the two matching models implicitly provide pseudo labels on = p(z )p (x |z )p(z )p (x |z )p (y|z , z ), unlabeled data for each other, which can be interpreted as an 1 θ 1 1 2 θ 2 2 θ 1 2 implicit Co-Training mechanism. where p(z1) = p(z2) = p(z) = N (z; 0, I) is a Gaussian pri- Our contributions are as follows: (i) We propose Dual- or. And pθ(y|z1, z2) is referred to as the embedding matcher View Variational AutoEncoder (DV-VAE) to unify the em- as it matches from the embedding view (latent space). bedding view and the interaction view of a sentence pair for semi-supervised text matching. An implicit Co-Training Inference Model. According to the conditional indepen- mechanism is also formulated to interpret the training pro- dence properties in the generative model, the variational pos- cess. (ii) We instantiate an implementation of DV-VAE terior qφ(z1, z2, y|x1, x2) can be factorized as: and adopt a novel sentence pair interaction matching mod- qφ(z1, z2, y|x1, x2) = qφ(z1|x1)qφ(z2|x2)qφ(y|z1, z2) el, where interaction matrices across words and contexts are = q (z |x )q (z |x )p (y|z , z ), (1) introduced to enrich the interaction features. (iii) Using three φ 1 1 φ 2 2 θ 1 2 datasets: SNLI, Quora and a Community QA dataset, we em- where we model qφ(y|z1, z2) by the embedding matcher pirically demonstrate the superiority of DV-VAE over several pθ(y|z1, z2). qφ(z1, z2, y|x1, x2) can also be factorized as: strong semi-supervised and supervised baselines. qφ(z1, z2, y|x1, x2)=qφ(y|x1, x2)qφ(z1, z2|x1, x2, y), (2) 2 Dual-View Variational Antoencoder where we model qφ(y|x1, x2) by a sentence pair interaction Suppose that we have a labeled sentence pair set Dl and an matching model to match from the interaction view. Thus unlabeled sentence pair set Du. (x1, x2, y) ∈ Dl denotes a qφ(y|x1, x2) is referred to as the interaction matcher and is labeled sentence pair, where x1, x2 are two sentences and adopted to make predictions at test time. In analogy to Co- y ∈ {1, 2,...,C} is the matching degree of x1 and x2. Here Training [Blum and Mitchell, 1998], we assume that each y is discretized and text matching is treated as a classifica- of the embedding view and the interaction view is suffi- tion problem. Similarly, (x1, x2) ∈ Du denotes an unlabeled cient to train the corresponding matcher, and the predictions pair. Our goal is to develop a semi-supervised text matching from the two matchers are consistent in the inference process: model using both the labeled and unlabeled data Dl and Du, qφ(y|x1, x2) = pθ(y|z1, z2). With this consistency assump- which can improve upon the performance of supervised text tion, we obtain the following from Equ (1) and Equ (2): matching models using the labeled data Dl only. qφ(z1, z2|x1, x2, y) = qφ(z1|x1)qφ(z2|x2),

2.1 Model Architecture qφ(z1, z2, y|x1, x2) = qφ(y|x1, x2)qφ(z1|x1)qφ(z2|x2), The probabilistic graphical model of DV-VAE is shown in which are taken as the inference model in labeled Figure 2. It consists of a generative model matching from the and unlabeled cases, respectively. Here encoders embedding view and an inference model matching from the 2 qφ(z1|x1) = N (z1; µφ(x1), diag(σφ(x1))) and qφ(z2|x2) = interaction view. 2 N (z2; µφ(x2), diag(σφ(x2))) are diagonal Gaussians. Generative Model. The generative process of a sentence pair and their matching degree (x1, x2, y) is defined as fol- Objective dZ lows: two continuous latent codes z1, z2 ∈ R are indepen- The variational lower bound of the data likelihood is used as dently sampled from a prior p(z), and are used to generate the objective, for both the labeled and unlabeled data.

5307 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

α For a labeled sentence pair (x1, x2, y), (i) For a labeled sentence pair, minimizing L (x1, x2, y) also minimizes the discriminative losses (−D(x , x , y) and log p (x , x , y) 1 2 θ 1 2 − log q (y|x , x )) for the two matchers, which are indepen-   φ 1 2 E pθ(x1, x2, y, z1, z2) dently trained just as in . ≥ qφ(z1,z2|x1,x2,y) log qφ(z1, z2|x1, x2, y) (ii) For an unlabeled sentence pair, we analyze the gradi- E ents of U(x1, x2) in Equ (4) w.r.t. θm and φm, respectively. = qφ(z1|x1)qφ(z2|x2)[log pθ(x1|z1)+ For the embedding matcher pθm (y|z1, z2), log p (x |z ) + log p (y|z , z )] θ 2 2 θ 1 2 X − KL(qφ(z1|x1)kp(z1)) − KL(qφ(z2|x2)kp(z2)) ∇θm U(x1, x2)= qφm (y|x1, x2)∇θm[−D(x1, x2, y)], ≡ −L(x , x , y). y 1 2 (7)

We rewrite L(x1, x2, y) as where the discriminative gradient ∇θm[−D(x1, x2, y)] is

reweighted by the predicted distribution qφm (y|x1, x2) from L(x1, x2, y) = −R(x1, x2) − D(x1, x2, y) (3) the interaction matcher. + KL(qφ(z1|x1)kp(z1)) + KL(qφ(z2|x2)kp(z2)), 1 For the interaction matcher qφm (y|x1, x2), E where −R(x1, x2) = qφ(z1|x1)qφ(z2|x2)[− log pθ(x1|z1) − ∇φm U(x1, x2) log pθ(x2|z2)] is the reconstruction loss of x1 and x2; X E = L(x1, x2, y)∇φm [qφm (y|x1, x2)] −D(x1, x2, y) = qφ(z1|x1)qφ(z2|x2)[− log pθ(y|z1, z2)] can be seen as an expected discriminative loss for the embedding y p (y|z , z ) matcher θ 1 2 ; and the last two KL-divergence terms − ∇φm H[qφm (y|x1, x2)] (8) regularize the posteriors to be close to the priors. X = − D(x , x , y)∇ [q (y|x , x )] For an unlabeled sentence pair (x1, x2), 1 2 φm φm 1 2 y log pθ(x1, x2) − ∇φm H[qφm (y|x1, x2)] (9)  p (x , x , y, z , z ) ≥ E log θ 1 2 1 2 X n qφ(z1,z2,y|x1,x2) = qφ (y|x1, x2) D(x1, x2, y)∇φ [ qφ(z1, z2, y|x1, x2) m m y "   E E pθ(x1, x2, y, z1, z2) o = qφ(y|x1,x2) qφ(z1,z2|x1,x2,y) log − log qφm (y|x1, x2)] − ∇φm H[qφm (y|x1, x2)]. (10) qφ(z1, z2|x1, x2, y) # The first term can be seen as an application of the RE- − log qφ(y|x1, x2) INFORCE algorithm [Williams, 1992] from Reinforcemen- t Learning, such that D(x1, x2, y), matching degree y, sen- X tence pair (x , x ) and q (y|x , x ) correspond to the re- = q (y|x , x )(−L(x , x , y)) + H[q (y|x , x )] 1 2 φm 1 2 φ 1 2 1 2 φ 1 2 ward signal, action, state and decision model, respective- y ly [Mnih and Gregor, 2014; Xu et al., 2017]. The second ≡ −U(x1, x2). (4) term maximizes the entropy of qφm (y|x1, x2), and is treated as a regularizer. Since qφ(y|x1, x2) is not included in the expression of With the training process going on, the two matchers dis- L(x1, x2, y), we explicitly add a discriminative loss for tinguish correct and incorrect labels better and better through qφ(y|x1, x2), weighted by α: the supervised loss Lα(x , x , y). Therefore, for unlabeled α 1 2 L (x1, x2, y) = L(x1, x2, y) + α[− log qφ(y|x1, x2)]. (5) sentence pairs, the weight for the embedding matcher’s dis- Finally, we obtain the objective function to be minimized criminative gradient in Equ (7) becomes larger on correct on the entire dataset D ∪ D : ys, and the interaction matcher receives larger reward signals l u when it gives correct predictions in Equ (10). This is an alter- X α X J = L (x1, x2, y) + U(x1, x2). (6) native way of providing pseudo labels for unlabeled data, and

(x1,x2,y)∈Dl (x1,x2)∈Du can be treated as an implicit Co-Training mechanism. 2.2 Implicit Co-Training 2.3 Model Implementation In this section, we show that the training process of DV- We present an implementation of DV-VAE (shown in Fig- VAE is implicitly related to Co-Training [Blum and Mitchell, ure 3) in detail, which consists of an encoder qφ(z|x), a de- 1998], where two classifiers are iteratively trained to explicit- coder pθ(x|z), an embedding matcher pθ(y|z1, z2) and an in- ly provide pseudo labels on unlabeled data for each other. S- teraction matcher qφ(y|x1, x2). ince in DV-VAE the embedding matcher pθ(y|z1, z2) and the 1 interaction matcher qφ(y|x1, x2) are simultaneously trained To derive Equ (8), note that φm does not exist in P through the optimization of J in Equ (6), we analyze their L(x1, x2, y); to derive Equ (9), note that y K∇ωq(y; ω) = P gradients to study the training process. For clarity, we spec- K∇ω y q(y; ω) = K∇ω1 = 0 when K is irrelevant with y, ify the parameters in qφ(y|x1, x2) as φm and the parameters which is the case for R(x1, x2) and the KL terms in L(x1, x2, y); in pθ(y|z1, z2) as θm, respectively. to derive Equ (10), use ∇ωq(y; ω) = q(y; ω)∇ω log q(y; ω).

5308 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

dilated CNN μ M1 1 z1 ^ X1 : x1: W11, W12,W13,W14,W15, W11 Bi-LSTM W11 W12 E1 W12 W13 M2 σ1 W13 W14 W14 W15 H1 W15 dilation=1 dilation=2

dilated CNN M3 CNN y dilation=1 dilation=2 ^ X2 : H2 W21 21 W22 μ2 W M4 W22 W23 W23 W24 E2 Bi-LSTM W24 x2: W21, W22,W23,W24, 2d σ2 z2 conv ReLU masked dilated conv ReLU conv 1×1, d 1× k , d 1×1, 2d +

encoder decoder interaction matcher embedding matcher Bottleneck Residual Block

Figure 3: An implementation of DV-VAE.

Encoder qφ(z|x). We adopt a bidirectional LSTM (Bi- We match every word embedding in E1 with those in E2, LSTM) as the encoder. The last hidden states of the forward and match every context in H1 with those in H2. We also and backward directions are concatenated and fed into two cross-match the contexts in H1 (or H2) with the words in Multi-Layer (MLPs) to compute the mean µ and E2 (or E1) to catch the matching patterns between contexts 2 the standard deviation σ for qφ(z|x) = N (z; µ, diag(σ )). and words. Therefore, we obtain four interaction matrices T1×T2 M1,M2,M3,M4 ∈ R : Decoder pθ(x|z). To avoid the training collapse of VAE- based text generation models, we adopt a dilated CNN se- T M1(i, j) = tanh(h h2j ), quence decoder that is most similar to the one in [Yang et al., 1i 1 −→T ←−T 2017]: Latent code z is concatenated to every word embed- M2(i, j) = tanh( (h1i e2j + h1i e2j )), ding of x to serve as the decoder input. Every feature dimen- 2 sion of the decoder input is treated as a channel and masked 1 T −→ T ←− M3(i, j) = tanh( (e h2j + e h2j )) and one-dimensional convolution proceeds in the time dimension. 2 1i 1i T Dilated convolution with rate r is applied so that one in ev- M4(i, j) = tanh(e1ie2j ), ery r consecutive inputs is picked to convolve with the filter. Multiple dilated convolution layers are stacked with exponen- where we set dE = dH to allow for the dot product between 0 1 2 −→ ←− tially increasing dilation rates [2 , 2 , 2 ,... ], and every lay- hij (or hij ) and eij . Then M1,M2,M3,M4 are stacked as a er is wrapped in a bottleneck residual block [He et al., 2016] four-channel input for a CNN, whose output is passed through to ease optimization. The outputs of the last layer are fed in- an MLP to predict the final matching degree y. to a fully connected layer followed by a softmax nonlinearity to produce the probability pθ(wj|w

{h21, h22, . . . , h2T2 } for x2. set. We experiment on five random labeled/unlabeled splits

5309 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

SNLI Quora Models 28k 59k 120k 1k 5k 10k 25k LSTM-AE [Zhao et al., 2018] 59.9 64.6 68.5 59.3 63.8 67.2 70.9 DeConv-AE [Shen et al., 2018] 62.1 65.5 68.7 60.2 65.1 67.7 71.6 LSTM-ARAE [Zhao et al., 2018] 62.5 66.8 70.9 - - - - LSTM-LVM [Shen et al., 2018] 64.7 67.5 71.1 62.9 67.6 69.0 72.4 DeConv-LVM [Shen et al., 2018] 67.2 69.3 72.2 65.1 69.4 70.5 73.7 Our interaction matcher∗ 70.880.84 73.700.70 77.350.16 63.681.44 71.970.60 72.391.64 75.920.58 DV-VAE 71.191.10 74.540.68 78.790.27 68.121.02 73.070.37 74.690.55 77.040.22

∗ Table 1: Matching accuracy on SNLI and Quora, in percentage. Our interaction matcher is trained on Dl in a supervised manner, while DV-VAE and the baselines are trained on Dl and Du in a semi-supervised manner.

Datasets #Train #Dev #Test #Classes 3.2 Evaluations on Text Matching SNLI 549,367 9,842 9,824 3 Natural Language Inference. First, we compare DV-VAE Quora 384,348 10,000 10,000 2 with semi-supervised baselines that combine autoencoders CQA 16,541 1,645 1,976 3 with sentence-encoding based matching models, and the re- sults are reported in Table 1. Results indicate that DV-VAE Table 2: Dataset statistics. consistently outperforms all the semi-supervised baselines by a large margin (3.9% ∼ 6.5%) under all the 3 labeled data sizes. These results demonstrate the importance of incorpo- of the train set for each amount of labeled data, and report the rating the interaction view in DV-VAE for semi-supervised mean and standard deviation of the matching accuracies. (ii) D text matching. Second, we report in Table 1 the results from For the CQA dataset, the original train set is used as l, and our interaction matcher trained on D in a supervised manner. we additionally adopt WikiQA [Yang et al., 2015], which has l D 2 DV-VAE consistently outperforms the supervised interaction 29k QA pairs, as u by removing all its labels. matcher, verifying its effectiveness on using unlabeled data to Model Configurations. We set dE = 300 and dZ = 500. improve supervised learning. [ Word embeddings are initialized with Glove Pennington et Paraphrase Identification. We get similar results on Quo- ] al., 2014 . We share the parameters of Bi-LSTMs in the en- ra, as shown in Table 1. DV-VAE’s accuracy gains over the coder and the interaction matcher. Hidden state size dH is semi-supervised baselines are consistently more than 3% for set to 300 for both directions. For the decoder, we choose a all the 4 labeled data sizes. DV-VAE also achieves further 3-layer dilated CNN, with dilation rates [1, 2, 4]. In all the accuracy gains over the supervised interaction matcher, and bottleneck residual blocks, filter size is set to 3 and channel when labeled data is scarce (|Dl| = 1k), the absolute im- numbers are set to 300 internally and 600 externally. In the provement is up to 4.4%. interaction matcher, we adopt a 2-layer CNN with filter sizes 5 × 5 × 8 and 3 × 3 × 16 such that a dynamic pooling is after Community Question Answering. We compare our mod- the first layer to get 4 × 4 fixed-sized feature maps and a max el with several strong supervised baselines in Table 3. These D pooling is after the second layer, followed by a 3-layer MLP baselines and our interaction matcher are trained on l and D D with 16, 8 and C hidden units. ReLU is used as the nonlin- DV-VAE is trained on l and u (WikiQA). Results show earity and Batch Normalization is adopted in each layer. that DV-VAE outperforms all the baselines by leveraging the additional 29k unlabeled WikiQA sentence pairs, achieving Training Details. We use the reparameterization trick and an accuracy gain of 1.3% over the state of the art method z q (z|x) sample one from φ to estimate the variational lower KEHNN [Wu et al., 2018]. Note that KEHNN leverages ad- [ ] α bounds Kingma and Welling, 2014 . in Equ (5) is set to 20. ditional prior knowledge of the QA pairs while we leverage L(x , x , y) We substitute the two KL-divergence terms in 1 2 additional unlabeled QA pairs. This indicates that sufficient with max(γ, KL(qφ(z1|x1)kp(z1)) + KL(qφ(z2|x2)kp(z2))) to force the decoder rely more on latent codes [Yang et al., 2017]. We set γ = 10 for SNLI, and γ = 20 for the other Models Accuracy experiments. SGD with momentum 0.9 and weight decay Attentive-LSTM 73.6 1×10−3 is adopted in optimization. We use an initial learning Match-LSTM 74.3 rate of 3 × 10−3. Batch size is tuned on {32, 64, 128} for ARC-II 71.5 MatchPyramid 71.7 each experiment. We sample half of the minibatch from Dl MV-LSTM 73.5 and half from Du in each iteration. We adopt early stopping where performance on dev set is evaluated every time D is MultiGranCNN 74.3 l KEHNN [Wu et al., 2018] + prior knowledge 75.3 traversed. A dropout rate of 0.1 is used in each layer of the Our interaction matcher 74.4 decoder net. Experiments are implemented in PyTorch. DV-VAE + 29k unlabeled WikiQA data 76.6 2We use the train/dev/test split of [Wang et al., 2017] on Quora. Due to memory limitations, we truncate the texts in Quora, CQA and Table 3: Matching accuracy on the CQA dataset, in percentage. Re- WikiQA to have no more than 100, 500 and 100 tokens, respectively. sults in the first 7 rows are from [Wu et al., 2018].

5310 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

0.6 x : a child is playing with a soccer ball in the grass . (x1, x2, ycorrect) 1 (μ̂ = − 1.18, σ̂ = 0.67) x2: a man is getting ready to throw a bowling ball down the lane . (x , x , y ) 0.3  1 2 incorrect y: neutral (μ̂ = − 1.38, σ̂ = 0.75) x1: the woman is sitting on the bench . 0.0 x −7 −6 −5 −4 −3 −2 −1 0 2: a woman in a black shirt is walking down the street . y: contradiction (a) SNLI (|Dl| = 28k) x1: a group of people are riding on a roller coaster . 1.2 x2: there is a group of people playing soccer . (x1, x2, ycorrect) y: contradiction 0.9 (μ̂ = − 0.63, σ̂ = 0.50) x : two dogs are playing with a red ball . 0.6 (x , x , y ) 1  1 2 incorrect x : the dog is running . ̂ ̂ 2 0.3 (μ = − 1.07, σ = 0.69) y: entailment 0.0 −7 −6 −5 −4 −3 −2 −1 0 x1: two young men , one wearing a white shirt and the other wearing a white shirt , are sitting on a bench . (b) Quora (|Dl| = 25k) x2: there is a man in a white shirt . y: entailment Figure 4: Distribution of the reward signal D(x1, x2, y) on Du. Table 4: Sentence pairs generated by DV-VAE trained on SNLI D amount of unlabeled data may play the role of prior knowl- (| l| = 28k). edge in terms of the performance improvement. 3.3 Model Visualization natural language sentences; [Xu et al., 2017] adopted semi- supervised VAE proposed in [Kingma et al., 2014] for text We first visualize the reward D(x1, x2, y) for the inter- classification. (ii) Others developed specific VAE structures action matcher in Figure 4 to justify the implicit Co- for sequence transduction tasks such as sentence compression Training mechanism. The distributions of D(x1, x2, ycorrect) [Miao and Blunsom, 2016], machine translation [Zhang et al., D and D(x1, x2, yincorrect) on u are distinguishable, and 2016], dialogue generation [Serban et al., 2017]. (iii) To our the mean for D(x1, x2, ycorrect) is larger than that for knowledge, there are few studies modeling a pair of texts with D(x1, x2, yincorrect), which is statistically significant (p < VAE except [Shen et al., 2018] that adopted deconvolutional 0.01). This demonstrates that the interaction matcher may networks in a VAE for semi-supervised text matching. How- receive larger reward signals by predicting correct ys than ever, this method matches texts from the embedding view on- incorrect ones on the unlabeled data. With the embed- ly, while we further combine the interaction view. ding matcher providing useful reward signals, the interaction Our work is also related to Multi-View Learning [Xu et matcher can effectively leverage unlabeled data. al., 2013] where features can be separated into distinct sub- We then visualize the learned decoder and embedding sets (views). Particularly relevant are Co-Training [Blum matcher in DV-VAE by generating labeled sentence pairs and Mitchell, 1998] and Co-Regularization [Sindhwani et al., (x1, x2, y) from latent codes z1, z2 sampled from p(z) = 2005], where two models train each other on unlabeled data. N (z; 0, I). Some generated examples are shown in Table 4, However, instead of explicitly designing an algorithm or an demonstrating the capability of DV-VAE to learn the data objective to enable the Co-Training mechanism, we implicit- manifold that is useful for semi-supervised classification. ly achieve it by maximizing the variational lower bound.

4 Related Work 5 Conclusions Learning to match text sequences is a long standing problem In this study, we have proposed Dual-View Variational Au- and most state of the art methods use a compare-aggregate ar- toEncoder (DV-VAE) to unify the embedding view and the [ ] [ chitecture Wang and Jiang, 2017 , such as DIIN Gong et al., interaction view for semi-supervised text matching. Gradient ] [ ] [ ] 2018 , CSRAN Tay et al., 2018 , MwAN Tan et al., 2018 analysis has also revealed an implicit Co-Training mechanis- [ ] [ ] and KEHNN Wu et al., 2018 . Lan and Xu, 2018 compared m to explain the semi-supervised learning process. Finally, a broad range of text matching models over eight dataset- our experimental study has verified the effectiveness of DV- s. They all focus on supervised learning while we explore VAE. Further, our work is a step towards combining multi- semi-supervised methods leveraging unlabeled text pairs to view learning with neural network models, which seems a improve the performance of supervised methods. promising strategy for semi-supervised . Close to our work are recent applications of variational au- toencoders [Kingma and Welling, 2014] and NVIL [Mnih and Gregor, 2014] in NLP. (i) Some focused on modeling a s- Acknowledgments ingle piece of text: [Mnih and Gregor, 2014] and [Miao et This work is supported in part by National Key R&D Program al., 2016] used bag of words methods for document mod- of China 2018YFB1700403, NSFC U1636210&61421003, Shen- eling; [Bowman et al., 2016b; Yang et al., 2017] and oth- zhen Institute of Computing Sciences, and the Fundamental ers explored VAE and various improved models to generate Research Funds for the Central Universities.

5311 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

References [Rocktaschel¨ et al., 2016] Tim Rocktaschel,¨ Edward Grefen- [Blum and Mitchell, 1998] Avrim Blum and Tom M. stette, Karl Moritz Hermann, Toma´sˇ Kociskˇ y,´ and Phil Mitchell. Combining labeled and unlabeled data with Blunsom. Reasoning about entailment with neural atten- co-training. In COLT, pages 92–100, 1998. tion. In ICLR, 2016. [Bowman et al., 2015] Samuel R. Bowman, Gabor Angeli, [Serban et al., 2017] Iulian Vlad Serban, Alessandro Sor- Christopher Potts, and Christopher D. Manning. A large doni, Ryan Lowe, Laurent Charlin, Joelle Pineau, annotated corpus for learning natural language inference. Aaron C. Courville, and Yoshua Bengio. A hierarchical In EMNLP, pages 632–642, 2015. latent variable encoder-decoder model for generating dia- logues. In AAAI, pages 3295–3301, 2017. [Bowman et al., 2016a] Samuel R. Bowman, Jon Gauthier, [ ] Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, Shen et al., 2018 Dinghan Shen, Yizhe Zhang, Ricardo and Christopher Potts. A fast unified model for parsing and Henao, Qinliang Su, and Lawrence Carin. Deconvolution- sentence understanding. In ACL, pages 1466–1477, 2016. al latent-variable model for text sequence matching. In AAAI, pages 5438–5445, 2018. [Bowman et al., 2016b] Samuel R. Bowman, Luke Vilnis, [Sindhwani et al., 2005] Vikas Sindhwani, Partha Niyogi, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz,´ and and Mikhail Belkin. A co-regularization approach to semi- Samy Bengio. Generating sentences from a continuous supervised learning with multiple views. In ICML Work- space. In CoNLL, pages 10–21, 2016. shop, 2005. [Dolan and Brockett, 2005] Bill Dolan and Chris Brockett. [Tan et al., 2018] Chuanqi Tan, Furu Wei, Wenhui Wang, Automatically constructing a corpus of sentential para- Weifeng Lv, and Ming Zhou. Multiway attention networks phrases. In IWP, 2005. for modeling sentence pairs. In IJCAI, 2018. [ et al. ] Gong , 2018 Yichen Gong, Heng Luo, and Jian [Tay et al., 2018] Yi Tay, Anh Tuan Luu, and Siu Cheung Zhang. Neural language inference over interaction space. Hui. Co-stack residual affinity networks with multi-level ICLR In , 2018. attention refinement for matching text sequences. In [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing EMNLP, pages 4492–4502, 2018. Ren, and Jian Sun. Deep residual learning for image recog- [Wang and Jiang, 2017] Shuohang Wang and Jing Jiang. A nition. In CVPR, pages 770–778, 2016. compare-aggregate model for matching text sequences. In [Kingma and Welling, 2014] Diederik P Kingma and Max ICLR, 2017. Welling. Auto-encoding variational bayes. In ICLR, 2014. [Wang et al., 2017] Zhiguo Wang, Wael Hamza, and Radu [Kingma et al., 2014] Diederik P. Kingma, Shakir Mo- Florian. Bilateral multi-perspective matching for natural hamed, Danilo Jimenez Rezende, and Max Welling. Semi- language sentences. In IJCAI, pages 4144–4150, 2017. supervised learning with deep generative models. In NIPS, [Williams, 1992] Ronald J. Williams. Simple statistical pages 3581–3589, 2014. gradient-following algorithms for connectionist reinforce- [Lan and Xu, 2018] Wuwei Lan and Wei Xu. Neural net- ment learning. , 8:229–256, 1992. work models for paraphrase identification, semantic textu- [Wu et al., 2018] Yu Wu, Wei Wu, Can Xu, and Zhoujun al similarity, natural language inference, and question an- Li. Knowledge enhanced hybrid neural network for text swering. In COLING, pages 3890–3902, 2018. matching. In AAAI, pages 5586–5593, 2018. [Miao and Blunsom, 2016] Yishu Miao and Phil Blunsom. [Xu et al., 2013] Chang Xu, Dacheng Tao, and Chao Xu. A Language as a latent variable: Discrete generative models survey on multi-view learning. arXiv:1304.5634, 2013. for sentence compression. In EMNLP, 2016. [Xu et al., 2017] Weidi Xu, Haoze Sun, Chao Deng, and Y- [Miao et al., 2016] Yishu Miao, Lei Yu, and Phil Blunsom. ing Tan. Variational autoencoder for semi-supervised text Neural variational inference for text processing. In ICML, classification. In AAAI, pages 3358–3364, 2017. pages 1727–1736, 2016. [Yang et al., 2015] Yi Yang, Wen-tau Yih, and Christopher [Mnih and Gregor, 2014] Andriy Mnih and Karol Gregor. Meek. Wikiqa: A challenge dataset for open-domain ques- Neural variational inference and learning in belief net- tion answering. In EMNLP, pages 2013–2018, 2015. works. In ICML, pages 1791–1799, 2014. [Yang et al., 2017] Zichao Yang, Zhiting Hu, Ruslan [Mueller and Thyagarajan, 2016] Jonas Mueller and Aditya Salakhutdinov, and Taylor Berg-Kirkpatrick. Improved Thyagarajan. Siamese recurrent architectures for learning variational autoencoders for text modeling using dilated sentence similarity. In AAAI, pages 2786–2792, 2016. convolutions. In ICML, pages 3881–3890, 2017. [Nakov et al., 2015] Preslav Nakov, Llu´ıs Marquez,` Walid [Zhang et al., 2016] Biao Zhang, Deyi Xiong, Jinsong Su, Magdy, Alessandro Moschitti, Jim Glass, and Bilal Ran- Hong Duan, and Min Zhang. Variational neural machine deree. Semeval-2015 task 3: Answer selection in commu- translation. In EMNLP, pages 521–530, 2016. nity question answering. In SemEval, 2015. [Zhao et al., 2018] Junbo Jake Zhao, Yoon Kim, Kelly [Pennington et al., 2014] Jeffrey Pennington, Richard Zhang, Alexander M. Rush, and Yann LeCun. Adversari- Socher, and Christopher D. Manning. Glove: Global ally regularized autoencoders. In ICML, pages 5897–5906, vectors for word representation. In EMNLP, 2014. 2018.

5312