<<

Selection Explorations and Debias Methods for Natural Language Sentence Matching Datasets Guanhua Zhang1,2∗, Bing Bai1∗, Jian Liang1, Kun Bai1, Shiyu Chang3, Mo Yu3, Conghui Zhu2, Tiejun Zhao2 1Cloud and Smart Industries Group, Tencent, China 2Harbin Institute of Technology, China 3MIT-IBM Watson AI Lab, IBM Research, USA {guanhzhang,icebai,joshualiang,kunbai}@tencent.com, [email protected], [email protected], {chzhu,tjzhao}@hit-mtlab.net

Abstract 1.0

Natural Language Sentence Matching 0.5 (NLSM) has gained substantial attention from

both academics and the industry, and rich 0

Normalized feature 1 public datasets contribute a lot to this process. 0.0 However, biased datasets can also hurt the negative WMD S1_freq S2_freq S1S2_inter generalization performance of trained models Figure 1: Visualization for the distributions of nor- and give untrustworthy evaluation results. For malized features versus the label in QuoraQP. The many NLSM datasets, the providers select right part (in red) represents the distributions of some pairs of sentences into the datasets, duplicated pairs, and the left part (in blue) rep- and this procedure can easily bring resents the distributions of not duplicated pairs. unintended pattern, i.e., selection bias. One Best viewed in color. example is the QuoraQP dataset, where some content-independent na¨ıve features are unreasonably predictive. Such features are 2017; Tien et al., 2018), including QuoraQP1, the reflection of the selection bias and termed as the “leakage features.” In this paper, we SNLI (Bowman et al., 2015), SICK (Marelli et al., investigate the problem of selection bias on six 2014), etc. These datasets provide resources for NLSM datasets and find that four out of them both training and evaluation of different algo- are significantly biased. We further propose a rithms (Torralba and Efros, 2011). training and evaluation framework to alleviate However, most of the datasets are prepared the bias. Experimental results on QuoraQP by conducting procedures involving a sampling suggest that the proposed framework can process, which can easily introduce a selection improve the generalization ability of trained bias models, and give more trustworthy evaluation (Heckman, 1977; Zadrozny, 2004). It would results for real-world adoptions. get even worse when the bias can reveal the label information, resulting in the “leakage features,” 1 Introduction which are irrelevant to the content/semantic of the sentences but are predictive to the label. One ex- Natural Language Sentence Matching (NLSM) ample is the QuoraQP, a dataset on classifying aims at comparing two sentences and identifying whether two sentences are duplicated (labeled as the relationships (Wang et al., 2017), and serves 1) or not (labeled as 0), which has been widely as the core of many NLP tasks such as question used to evaluate STS models (Gong et al., 2017; answering and information retrieval (Wang et al., Kim et al., 2018; Wang et al., 2017; Devlin et al., 2016b). Natural Language Inference (NLI) (Bow- 2018). In QuoraQP, three leakage features have man et al., 2015) and Semantic Textual Similar- been identified, including S1 freq, the number ity (STS) (Wang et al., 2016b) are both typical of occurrences of the first sentence in the dataset; NLSM problems. A large number of publicly S2 freq, the number of occurrences of the sec- available datasets have benefited the research to ond sentence; and S1S2 inter, the number of a great extent (Kim et al., 2018; Wang et al., sentences that are paired with both the first and the

* Equal contributions from both authors. This work was 1https://data.quora.com/First-Quora-Dataset-Release- done when Guanhua Zhang was an intern at Tencent. Question-Pairs

4418 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4418–4429 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 1.0 31 embodied in the comparing relationships of sen- tences, and the main contributions of this paper 26 0.8 are the answers to the following questions: 21 0.6 • Does selection bias exist in other NLSM 16

S1_freq datasets? We identify four out of six pub- 0.4 11 licly available datasets that suffer from the

0.2 selection bias. 6 • Would Deep Neural Network (DNN)-based 1 0.0 1 6 11 16 21 26 31 S2_freq methods learn from the bias pattern un- intentionally? We find that Siamese-LSTM Figure 2: The averages of the labels under different models trained on QuoraQP do capture the S1 freq and S2 freq. Red squares indicate that the bias pattern. averages are close to 1, and blue squares indicate that the averages are close to 0. Best viewed in color. • Can we help the model learn the useful semantic pattern from the content with- out fitting the bias pattern? We propose second sentences in the dataset for comparison. an easy-adopting method to mitigate the bias. Figure1 shows the distributions of normalized show that this method can im- (negative) Word Mover’s Distance (WMD) (Kus- prove the generalization performance of the ner et al., 2015) and normalized leakage features trained models. versus the labels in QuoraQP. The features are • Can we build an evaluation framework all normalized to their quantiles. As illustrated, that gives us more reliable results for real- the leakage features are more predictive than the world adoption? We propose a more trust- WMD, as the differences between the distribu- worthy evaluation method that demonstrates tions of positive and negative pairs are more sig- consistent results with unbiased cross-dataset S1 freq nificant. Moreover, combining and evaluations. S2 freq can make even more accurate predic- tions as illustrated in Figure2, where we cal- The rest of the paper is organized as follows. culate the averages of the labels under different Section2 gives an empirical look at the selection S1 freq and S2 freq. We find that when both bias on a variety of NLSM datasets and analyzes features’ values are large, the pairs tend to be why the leakage features are effective. Section3 duplicated (marked in red), while when one examines whether DNN-based methods fit the bias is large and the other is small, the pairs tend to be pattern unintentionally. Section4 introduces the not duplicated (marked in blue). training and evaluation framework to alleviate the These leakage features play a critical role in biasedness. Taking QuoraQP as an example, we the QuoraQP competition2. As the evaluations are report the experimental results in Section5. Sec- conducted with the same biased datasets, models tion6 summarizes related work, and Section7 that fit the bias pattern can take additional advan- draws the conclusion. tages over unbiased models, making the bench- 2 Empirical Study of the Selection Bias mark results untrustworthy. On the other hand, the bias pattern doesn’t exist in the real-world, so if a In this section, we investigate the problem of se- model fits the bias pattern (intentionally or unin- lection bias on six NLSM datasets and then ana- tentionally), the generalization performance will lyze why the leakage features are effective. be hurt, limiting the values of these datasets for 2.1 Quantifying the Biasedness in Datasets further applications (Torralba and Efros, 2011). In this paper, we study this problem and demon- To quantify the severity of the leakage from the strate the impact of the selection bias by a series selection bias, we formulate a toy problem for of experiments. We focus on the selection bias NLSM. We predict the semantic relationship of two sentences based on the comparing relation- 2https://www.kaggle.com/c/ ships between sentences. We refer semantic re- quora-question-pairs/discussion/ 34355 and https://www.kaggle.com/c/ lationship of two sentences as their labels, for ex- quora-question-pairs/discussion/33168 ample, duplicated for STS and entailment

4419 MultiNLI SICK Method SNLI QuoraQP MSRP ByteDance Matched Mismatched NLI STS Majority 33.7 35.6 36.5 50.00 66.5 56.7 50.3 68.59 Unlexicalized 47.7 44.9 45.5 68.20 73.9 70.1 70.2 75.23 LSTM 77.6∗ 66.9† 66.9† 82.58‡ 70.6 71.3> 70.2 86.45 Leakage 36.6 32.1 31.1 79.63 66.7 56.7 55.5 78.24 Advanced 39.1 32.7 33.8 80.47 67.9 57.5 56.3 85.73 Leakage vs Majority +8.61 -9.83 -14.79 +59.26 +0.30 0.00 +10.34 +14.07 Advanced vs Majority +16.02 -8.15 -7.40 +60.94 +2.11 +1.41 +11.93 +24.99

Table 1: The accuracy scores of predicting the label with unlexicalized features, leakage features, and advanced graph-based features and the relative improvements. Result with ∗ is from Bowman et al.(2015). Results with † are from Williams et al.(2018). Result with ‡ is from Wang et al.(2017). Result with  is from Shen et al.(2018). Result with > is from Baudisˇ et al.(2016). Other results are based on our implementations. “%” is omitted.

Sentence1 Sentence2 3 Label ByteDance . We apply two different methods ID ID 1 2 ? to classify the edges on the graph, including 1 3 ? Leakage which uses the three leakage features 1 5 ? 2 3 ? introduced in Section1 and Advanced which uses 2 4 ? some more advanced graph-based features (Per- 2 6 ? ozzi et al., 2014; Zhou et al., 2009; Liben-Nowell Figure 3: Illustration of the graph built for Problem1. and Kleinberg, 2007) together with the three 4 We only use the comparing relationships to build the leakage features . We also report the results graph. of three baselines, including Majority which predicts the most frequent label, Unlexicalized which uses 15 handcrafted features from the for NLI, and comparing relationship as whether content of sentences (Bowman et al., 2015)(e.g., they are paired for comparison in the dataset. Here the BLEU score (Papineni et al., 2002) of both we only consider the index of each sentence, and sentences, the length difference between the the actual content is not used. The formal problem two sentences, the percentage of overlap words, definition is as follow: and so on) and LSTM which is a DNN-based Problem 1 ( Leveraging the Leakage for NLSM). method using sequences of word embeddings. Given a set of sentence ids S, and a set of All classifiers are Random Forests if no specific comparing relationships of the sentences C = configuration is mentioned. The classifiers are trained with the training set, and we report the {hsi, sji}, si, sj ∈ S. The goal is to infer the semantic relationship between given pairs of sen- results on the testing set. More detailed settings tence ids from S. are introduced in AppendixA. The results are reported in Table1. This toy problem is indeed an edge classifica- Predicting semantic relationships without using tion problem (Aggarwal et al., 2016), as we can sentence contents seems impossible. However, we construct a graph using the comparing relation- find that the graph-based features (Leakage and ships as illustrated in Figure3. In addition, from Advanced) make the problem feasible on a wide S1 freq S2 freq the graph perspective, and range of datasets. Specifically, on the datasets S1S2 inter are the degrees of nodes, and is like QuoraQP and ByteDance, the leakage fea- the number of 2-hop paths connecting two nodes. tures are even more effective than the unlexical- Learning on the graph for this toy problem follows ized features. One exception is that on MultiNLI, a transductive setting (Ji et al., 2010), where the Majority outperforms Leakage and Advanced sig- graph is built with the comparing relationships of nificantly. Another interesting finding is that on all the examples. Based on the new problem definition, we 3https://www.kaggle.com/c/fake-news-pair- investigate six NLSM datasets, including classification-challenge 4The features are selected carefully to describe the local SNLI, MultiNLI (Williams et al., 2018), Quo- structure between two nodes and to prevent the model from raQP, MSRP (Dolan et al., 2004), SICK and remembering the exact ID of sentences to make inferences.

4420 Features SNLI QuoraQP SICKSTS ByteDance contradiction 60.0% S1 freq 33.7 65.90 54.5 68.61 entilment S2 freq 36.6 69.84 52.5 73.03 neutral 50.0% S1S2 inter 33.7 79.66 50.8 76.63 q S1 freq 36.6 79.62 53.5 77.17 40.0% q S2 freq 33.7 79.66 53.0 77.44 q S1S2 inter 36.6 74.75 54.2 74.39 percentage 30.0% all 36.6 79.63 55.5 78.24 Majority 33.7 50.00 50.3 68.59 20.0%

10.0% Table 2: Ablation experiments of the three leakage fea- 0 20 40 60 S2_freq tures on the datasets. “q” without the feature. We report the accuracy scores and “%” is omitted. Figure 4: The percentage of each label versus S2 freq in SNLI.

SNLI and ByteDance, advanced graph-based fea- ure4. We see that the percentages of the three S2 freq tures improve a lot over the leakage features, while labels are similar when is small, but as S2 freq on QuoraQP, the difference is very small. Among increases, the label is more likely to be an entailment. all the tested datasets, only MSRP and SICKNLI are almost neutral to the leakage features. Note For QuoraQP dataset, the providers state that that their sizes are relatively small with only less “Our original sampling method returned an im- than 10k samples. Results in Table1 raise con- balanced dataset with many more true examples cerns about the impact of selection bias on the of duplicate pairs than non-duplicates. Therefore, models and evaluation results. we supplemented the dataset with negative exam- 2.2 Why are the Leakage Features Effective? ples. One source of negative examples were pairs of “related questions” which, although pertain- As discussed in Section1, the leakage features ing to similar topics, are not truly semantically are the reflection of selection bias. Intuitively, if equivalent.” Our hypothesis is that the way in we construct a dataset for NLSM by randomly which negative samples were supplemented is the sampling some pairs of sentences, the resulting reason why QuoraQP is so biased. For example, dataset would be extremely imbalanced, where the newly added sentences of “related questions” the most of the pairs are neutral for NLI or may appear in the dataset for limited times, thus not duplicated for STS. Thus, to make the we get the phenomenon in Figure2, i.e., if two dataset relatively balanced, a sampling strategy is sentences both appear for many times, the pair is often required. If the strategy is not well-designed, likely to be duplicated, while if one of them it will introduce a bias pattern into the dataset, appears for only a few times, the pair is likely to which can be revealed by leakage features. Here be not duplicated. we try to figure out why the leakage features are effective in aforementioned datasets. Since we We conduct ablation experiments on the do not have every detail about how they are con- datasets where the leakage features are effective, structed, we only analyze based on SNLI and Quo- i.e., SNLI, QuoraQP, SICKSTS and ByteDance. raQP. The results are reported in Table2. We can see During the preparation of SNLI, as introduced that S2 freq is more effective in SNLI, and in (Bowman et al., 2015), human workers are S1 freq plays a more critical role in SICKSTS, presented with “premise scene descriptions,” and while in QuoraQP and ByteDance, S1S2 inter asked to supply “hypotheses” for each of the is the most predictive. three labels (i.e., entailment, neutral and contradiction). However, it is found that Based on the experiments and observations, we some workers are “reusing the same sentence conclude that existing datasets incline to be biased for many different prompts,” which might cause due to various reasons. More information about SNLI to suffer from selection bias. To validate, dataset preparations and further study are required we calculate the percentage of each label ver- to understand the problem and prevent bias from sus S2 freq, and the results are shown in Fig- being introduced into future datasets.

4421 1.0 tionally capturing the undesired bias pattern that only exists in the particular dataset. This will 0.8 make an adverse effect on the generalization per-

0.6 formance of the trained models (to be illustrated in

Predictions Section 5.4). 0.4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 4 Leakage-Neutral Learning and Leakage Features Evaluation Method Figure 5: Visualization of predicted scores versus the Given a biased dataset, can we eliminate the bias leakage feature. The boxes represent the upper quar- tiles to the lower quartiles of predicted scores, and the to train completely unbiased models? Unfortu- lowest datum is the 1.5 IQR of the lower quartile. nately, this is very difficult due to that the bias is related with the labels, and we cannot have ac- cess to the labels of unselected samples (Zadrozny, 3 Do NN Models Fit the Bias Pattern 2004). In this paper, we propose to take a step Unintentionally? back and define a leakage-neutral distribution, which is more close to the real-world than the bi- In this section, we investigate whether DNN mod- ased one. We make a few reasonable assumptions els are unintentionally fitting the bias pattern in about it and how the biased dataset is generated addition to the semantic pattern. We train a clas- 5 from it. We demonstrate that we can train and sical Siamese-LSTM model with the training set evaluate models unbiased to the leakage-neutral of QuoraQP, and make predictions on a synthetic distribution, with only the biased dataset. dataset. Interestingly, we find that the results are significantly influenced by the bias pattern. Generation of the biased dataset from leakage- The synthetic dataset is built in the following neutral distribution Assuming that there is a way. We extract the distinct sentences from the leakage-neutral distribution D with domain X × training set of QuoraQP, then compare the sen- Y×L×S where X is the semantic feature space, Y tences with themselves, finally we obtain 517,970 is the (binary) semantic label space, L is the sam- pairs in total. Since the two sentences in the pairs pling strategy feature space and S is the (binary) are identical, the labels are all duplicated. sampling intention space. The sampling intentions All three leakage features are the same, i.e., the represent whether dataset providers want to select numbers of occurrences of the sentence in the a positive or a negative sample. For exam- dataset. If the model can perfectly learn the se- ple, S = 1 means that the providers want to select mantic relationships between sentences, the pre- a positive sample here. dictions should be substantially the same for all We assume that samples (x, y, l, s) are drawn the pairs. independently from D, then if s = y (the label To illustrate the predicted scores of duplica- matches the sampling intention), the samples are tion, we visualize them versus the leakage fea- selected into the dataset, otherwise, the samples tures in Figure5, and the boxplot follows the are discarded. This operation results in the biased Tukey boxplot style (Frigge et al., 1989). Intrigu- distribution Db that are observed from the dataset. ingly, we find that even though the sentences in In this section, we use uppercase letters, such as pairs are all identical, the model still tends to give Y and S, to represent random variables, and low- lower scores of duplication to the pairs with leak- ercase letters, such as y and s, to represent specific age features equal to 1. This result is consis- values for samples. We use P (·) to represent the tent with the bias pattern shown in Figure2, i.e., Db probability on Db and omit the subscripts for D. the data points in the bottom left corner tend to be not duplicated, compared with the data Assumptions about the leakage-neutral dis- points in the top right corner which represent tribution We make the following assumptions larger values of S1 freq and S2 freq. about D. The first one is the leakage-neutral as- The results indicate that the model is uninten- sumption defined as follows, 5The detailed setting for the model is introduced in Sec- tion 5.2 P (Y |L) = P (Y ),

4422 which means that the sampling strategy is in- Algorithm 1: Leakage-neutral Training and Evaluation dependent with the labels, making the leakage- Input: The dataset {x, y}, the number of fold K for cross prediction, and the prior probability P (Y = 1). neutral distribution more close to the real-world. Procedure: The second one is that, given L, S is indepen- 01 Extract the leakage features l from the dataset. 02 Estimate P (Y = 1|l) for all samples by training clas- dent with X and Y defined as follows, Db sifiers and using K-fold cross-predicting strategy. 03 Calculate P (S = 1|l) for all samples according to P (S|X,Y,L) = P (S|L), Equation (1). 1 04 Obtain the weights w = P (S=y|l) for all samples and which means that the sampling strategy features normalize the of the weights. can completely control the sampling intentions. 05 Train and validate models with the training set and val- idation set respectively using w as the sample weights. Leakage-neutral learning and evaluation 06 Evaluate the models with the testing set using w as the sample weights. method Based on the assumptions above, given a biased dataset, the proposed method works in the following way. Theorem 1 (Unbiased Expectation). For any clas- Firstly, we estimate P (Y = 1|l) from the Db sifier f = f(x, l), and for any loss function ∆ = dataset for all samples. In practice, this can be ∆(f(x, l), y), if we use w = P (S=Y ) as weights, achieved by training classifiers and making cross- P (S=y|l) then predictions. Since we don’t have access to the true h i h i sampling strategy features, we use the leakage fea- E w∆f(x, l), y = E ∆(f(x, l), y) . x,y,l∼Db x,y,l∼D tures from the graph instead, as they are the reflec- tion of the biased sampling strategy. The proof is presented in Appendix B.2. Since Then we can get P (S = 1|l), the conditional P (S = Y ) is only a number which does not af- probability of the sampling intention S on D given fect the models, we can concentrate on the denom- 1 l, using the following equation with P (Y = 1) inator, i.e., P (S = y|l) and use w = P (S=y|l) given. as the weights instead. The loss can be used for both training and evaluation unbiased to the leak- P (S = 1|l) age neutral distribution. P (Y = 0)P (Y = 1|l) = Db . 5 Experimental Results for the P (Y = 0)P (Y = 1|l) + P (Y = 1)P (Y = 0|l) Db Db (1) Leakage-neutral Method on QuoraQP

The derivation of Equation (1) is presented in Ap- In this section, we present the experimental re- pendix B.1. sults for leakage-neutral learning on QuoraQP. We 1 demonstrate that the proposed learning framework Afterwards, we use w = P (S=y|l) as the weights for the samples (note that the labels y are can mitigate the bias and improve the general- needed here). Training and evaluating with the ization performance of trained models. Besides, weights can give us the results unbiased to the the corresponding evaluation method can serve as leakage-neutral distribution. a more reliable in-domain benchmark compared The step-by-step procedure for leakage-neutral with the biased one. learning and evaluation is presented in Algo- 5.1 Dataset Information and Weight rithm 1. Note that our analyses and the proposed Generation method are general enough for a variety of bias, as long as a sampling strategy feature is given, and We use QuoraQP as our experimental dataset. We can be easily extended to multi-class classification use the same dataset partition as (Wang et al., problems. 2017). We use the three leakage features for generat- Theoretical guarantee of unbiasedness As- ing the weights. We use Random Forest classifiers suming that we know P (S = y|l), and they are to estimate P (Y = 1|l), and the 100-fold cross Db greater than zero for any l, the following theorem predictions as the estimated values. P (Y = 1) shows that we can obtain the loss unbiased to the is chosen to keep the proportion of the weights of leakage neutral distribution after using the sample positive and negative samples unchanged in order weights. to prevent the influence of prior probabilities, and

4423 the mean of the weights is normalized to 1. The Method Biased Eva Debiased Eva minimum weight of all the samples is 0.51, and Majority 50.00 51.62 the maximum weight is 4953.17. Leakage 79.63 54.40 Biased Model 83.97 78.76 5.2 Settings Debiased Model 82.90 80.11 We implement a classical Siamese-LSTM model with Keras and Tensorflow (Abadi et al., 2016) Table 3: Evaluation Results with the testing set of Quo- backend. Sequences of the embeddings of word raQP. We report the accuracy scores and “%” is omit- ted. tokens are fed into the LSTM layer with a hidden size of 128. Then the representations of both sen- tences, as well as the dot-production of the rep- Method Synthetic MSRP SICKSTS resentations, go through a two Layer MLP where Biased Model 89.46 51.94 64.95 Batch Normalization (Ioffe and Szegedy, 2015) is Debiased Model 92.62 56.77 66.05 applied after every hidden layer. Dropout (Srivas- tava et al., 2014) with rate 0.5 is applied after the Table 4: Evaluation Results with the synthetic dataset, last hidden layer. We use the RMSProp (Tiele- MSRP and SICKSTS dataset. We report the accuracy man and Hinton, 2012) optimizer to train all the scores and “%” is omitted. parameters. The learning rate starts at 1e-3, and decays at a fixed rate of 0.2 when performance • Testing set evaluation. We evaluate the does not improve on the validation set. We also models with the testing set of QuoraQP. Eval- use a gradient clipping of 5.0. The batch size is uation without the weights is named as Bi- set to 256. All the results reported in this section ased Eva, and evaluation with the weights are the average numbers of ten runs using the same is named as Debiased Eva. This can show hyper-parameters with different random initializa- how the leakage-neutral evaluation proposed tions. Our implementation achieves slightly better in Section4 affect the evaluation results. performance compared with the results of the orig- inal Siamese-LSTM from Wang et al.(2017). • Synthetic dataset evaluation. We evalu- We initialize our word embeddings with pre- ate the performance of models with the syn- trained GloVe 840B 300D vectors (Pennington thetic dataset introduced in Section3. Given et al., 2014), and the embeddings are kept fixed the prior probabilities of positive/negative during training. All the sentences are cut off to classes fixed, a better model is supposed to have a maximum of 35 word tokens. give higher accuracy, and tended to be less Note that the scale of weights of the different impacted by the bias pattern. samples varies greatly. To prevent the model from jiggling during the mini-batch training, we use a • Cross-dataset evaluation. We evaluate sampling strategy for model training, i.e., we sam- that how the models perform on other STS ple examples with probabilities proportional to the datasets, i.e., MSRP and SICKSTS. We use weights to get the data for every mini-batch6. the entire datasets for evaluations. As the preparation strategies of different datasets are 5.3 Evaluation Scheme different, cross-dataset evaluations will not To evaluate the effectiveness of leakage-neutral give additional rewards for the selection bias learning, we use the following strategy in our ex- of QuoraQP. Although different datasets may periments. Firstly, we train and validate a model have different contexts, a better model trained using the data from QuoraQP without any weights. with QuoraQP is still supposed to perform The model is referred to as Biased Model. Then better. we train and validate a model using the data from QuoraQP with the weights, and the model is re- ferred to as Debiased Model. These two models Among all the evaluation methods, using the are evaluated with the following methods. testing set for evaluation without weights (Biased Eva) is biased, and we will show that the Debiased 6Codes and weights are published at https://github.com/arthua196/ Eva is more consistent with the unbiased synthetic Leakage-Neutral-Learning-for-QuoraQP dataset evaluation and cross-dataset evaluations.

4424 1.0 SICK, showing a better generalization strength. Moreover, the Debiased Eva gives results that 0.8 are more consistent with the results on unbiased

0.6 datasets, thus it can serve as a more reliable in-

Predictions Biased Model domain way to evaluate models trained with Quo- 0.4 Debiased Model raQP. As a conclusion, our constructed leakage- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 neutral distribution is more close to the real-world Leakage Features one compared with the biased distribution that is Figure 6: Visualization of predicted scores by the Bi- directly observed from the given datasets. ased and Debiased Models versus the leakage feature. Red boxes represent the results by the Biased Model, 6 Related Work and blue boxes represent the results by the Debiased In this section, we summarize the related work and Model. Best viewed in color. distinguish them from our contributions.

5.4 Experimental Results Inverse propensity score for Usu- ally, the Inverse Propensity Score (IPS) is used The evaluation results on the testing set of Quo- to reduce the selection bias (Schonlau et al., raQP are reported in Table3. From the accuracy 2009; d’Agostino, 1998), where the propensity of the method Leakage, we can see that although score (Rosenbaum and Rubin, 1983) is the prob- the influence isn’t completely eliminated, the eval- ability that a sample will be selected into the uation result of Debiased Eva is less impacted by dataset. Zadrozny(2004) studies the learning the bias pattern in the original distribution. This and evaluating of classifiers under sample selec- makes the results more reliable for evaluations. tion bias, while his focus was the “missing-at- The reason why in the Leakage method the bias random” (MAR) (Little and Rubin, 2014) problem could not be completely eliminated is that we can- where the biasedness only depends on the feature not estimate P (S = y|l) perfectly. A minor error vector x. of P (S = y|l) may result in a significant differ- For NLSM datasets, the selection bias is “not- ence in the weight especially when the probability missing-at-random” (NMAR) (Little and Rubin, is close to zero, since the multiplicative inverse is 2014), thus we cannot hope to estimate the true used. propensity scores directly as it requires the labels As for the Biased Model and the Debiased of unselected samples (Zadrozny, 2004). In this Model, we find that the Biased Model performs paper, we propose to fit a constructed leakage- significantly better under the Biased Eva. This neutral distribution, which could be achieved with is the effect of fitting the bias pattern in addition only the selected samples that we can access. to the semantic pattern, thus taking some extra advantage that cannot be generalized to real-life Biasedness of datasets Although dataset bias is cases. On the other hand, under the Debiased Eva, often mentioned, the research community is not we can find that the Debiased Model performs the putting sufficient attention to it compared with best. models and algorithms. Torralba and Efros(2011) Table4 reports the results on the datasets that studied the dataset bias for image recognition are not biased to the leakage pattern of QuoraQP. datasets, and categorize the bias into Selection We find that the Debiased Model significantly out- Bias, Capture Bias and Negative Set Bias. Selec- performs the Biased Model on all three datasets. tion bias is widely studied in the search ranking This indicates that the Debiased Model better cap- field as position bias (Wang et al., 2016a, 2018; tures the true semantic similarities of the input Joachims et al., 2017). Usually the propensity sentences. We further visualize the predictions on scores are estimated through online Result Ran- the synthetic dataset in Figure6. As illustrated, the domization (Joachims et al., 2017). Liang et al. predictions are more neutral to the leakage feature. (2019) studied the biasedness for authentication, From the experimental results, we can see that and proposed an additive adversarial learning for the proposed leakage-neutral training method is unbiased learning. effective, as the Debiased Model performs signif- In the NLP field, Minka and Robertson(2008) icantly better with Synthetic dataset, MSRP and studied the selection bias in the LETOR datasets,

4425 and found that Reverse BM25 performs unreason- the 32nd IEEE International Conference on Data ably well due to the selection procedure. Dixon Engineering, pages 1038–1049. et al.(2018) studied the potential unfairness for Petr Baudis,ˇ Jan Pichl, Toma´sˇ Vyskocil,ˇ and Jan toxic comments classification due to unintended Sedivˇ y.` 2016. Sentence pair scoring: Towards bias, and proposed methods to mitigate it by bal- unified framework for text comprehension. arXiv ancing the training dataset with additional data. preprint arXiv:1603.06127. Gururangan et al.(2018) and Poliak et al.(2018) found that in some NLI datasets, there is bi- Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large anno- asedness of specific linguistic phenomena, which tated corpus for learning natural language inference. makes it possible to classify the relationship of a In Proceedings of the 2015 Conference on Empiri- pair of sentences, by only looking at one of them. cal Methods in Natural Language Processing, pages Sugawara et al.(2018) investigated what makes 632–642. questions easier across recent 12 Machine Read- Ralph B d’Agostino. 1998. Propensity score meth- ing Comprehension (MRC) datasets and the re- ods for bias reduction in the comparison of a treat- sults suggest that one might overestimate recent ment to a non-randomized control group. Statistics advances in MRC. in medicine, 17(19):2265–2281. In this paper, we study the selection bias em- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and bodied in the comparing relationships in NLSM Kristina Toutanova. 2018. Bert: Pre-training of deep datasets. To the best of our knowledge, this is the bidirectional transformers for language understand- first study on this kind of selection bias. ing. arXiv preprint arXiv:1810.04805.

7 Conclusion Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigat- In this paper, we take a close look at the selection ing unintended bias in text classification. In Pro- bias of NLSM datasets and focus on the selection ceedings of the 2018 AAAI/ACM Conference on AI, , and Society, pages 67–73. ACM. bias embodied in the comparing relationships of sentences. To mitigate the bias, we propose an Bill Dolan, Chris Quirk, and Chris Brockett. 2004. easy-adopting method for leakage-neutral learning Unsupervised construction of large paraphrase cor- and evaluations. pora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference However, there is still much to do to form a on Computational Linguistics, page 350. Associa- clearer scope of this problem. For example, we tion for Computational Linguistics. still do not know the details of dataset prepara- tions of many other NLSM datasets, and we can Michael Frigge, David C Hoaglin, and Boris Iglewicz. not say to what extent the assumptions in Sec- 1989. Some implementations of the boxplot. The American Statistician, 43(1):50–54. tion4 hold in QuoraQP and what is the relation- ship between the leakage-neutral distribution and Yichen Gong, Heng Luo, and Jian Zhang. 2017. Natu- the real-world distribution. We suggest for future ral language inference over interaction space. arXiv NLSM datasets, the providers should pay more at- preprint arXiv:1709.04348. tention to this problem. Furthermore, they could Suchin Gururangan, Swabha Swayamdipta, Omer reveal the more detailed strategy of sample selec- Levy, Roy Schwartz, Samuel Bowman, and Noah A tion, and might publish some official weights to Smith. 2018. Annotation artifacts in natural lan- eliminate the bias. guage inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), References volume 2, pages 107–112. Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, James J Heckman. 1977. Sample selection bias as a Sanjay Ghemawat, Geoffrey Irving, Michael Isard, specification error (with an application to the esti- et al. 2016. Tensorflow: a system for large-scale mation of labor supply functions). machine learning. In OSDI, volume 16, pages 265– 283. Sergey Ioffe and Christian Szegedy. 2015. Batch nor- malization: Accelerating deep network training by Charu Aggarwal, Gewen He, and Peixiang Zhao. 2016. reducing internal covariate shift. In International Edge classification in networks. In Proceddings of Conference on Machine Learning, pages 448–456.

4426 Ming Ji, Yizhou Sun, Marina Danilevsky, Jiawei Han, Adam Poliak, Jason Naradowsky, Aparajita Haldar, and Jing Gao. 2010. Graph regularized transduc- Rachel Rudinger, and Benjamin Van Durme. 2018. tive classification on heterogeneous information net- Hypothesis only baselines in natural language in- works. In Joint European Conference on Machine ference. In Proceedings of the Seventh Joint Con- Learning and Knowledge Discovery in Databases, ference on Lexical and Computational Semantics, pages 570–586. Springer. pages 180–191. Thorsten Joachims, Adith Swaminathan, and Tobias Paul R Rosenbaum and Donald B Rubin. 1983. The Schnabel. 2017. Unbiased learning-to-rank with bi- central role of the propensity score in observational ased feedback. In Proceedings of the Tenth ACM studies for causal effects. Biometrika, 70(1):41–55. International Conference on Web Search and Data Mining, pages 781–789. ACM. Matthias Schonlau, Arthur Van Soest, Arie Kapteyn, and Mick Couper. 2009. Selection bias in web sur- Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and No- veys and the use of propensity scores. Sociological jun Kwak. 2018. Semantic sentence matching with Methods & Research, 37(3):291–318. densely-connected recurrent and co-attentive infor- Dinghan Shen, Guoyin Wang, Wenlin Wang, Mar- arXiv preprint arXiv:1805.11360 mation. . tin Renqiang Min, Qinliang Su, Yizhe Zhang, Chun- Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian yuan Li, Ricardo Henao, and Lawrence Carin. Weinberger. 2015. From word embeddings to docu- 2018. Baseline needs more love: On simple word- ment distances. In International Conference on Ma- embedding-based models and associated pooling chine Learning, pages 957–966. mechanisms. In Proceedings of the 56th Annual Meeting of the Association for Computational Lin- Jian Liang, Yuren Cao, Chenbin Zhang, Shiyu Chang, guistics (Volume 1: Long Papers), volume 1, pages Kun Bai, and Zenglin Xu. 2019. Additive adver- 440–450. sarial learning for unbiased authentication. arXiv Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, preprint arXiv:1905.06517. Ilya Sutskever, and Ruslan Salakhutdinov. 2014. David Liben-Nowell and Jon Kleinberg. 2007. The Dropout: a simple way to prevent neural networks link-prediction problem for social networks. Jour- from overfitting. The Journal of Machine Learning nal of the American society for information science Research, 15(1):1929–1958. and technology, 58(7):1019–1031. Saku Sugawara, Kentaro Inui, Satoshi Sekine, and Roderick JA Little and Donald B Rubin. 2014. Statis- Akiko Aizawa. 2018. What makes reading com- tical analysis with missing data, volume 333. John prehension questions easier? In Proceedings of the Wiley & Sons. 2018 Conference on Empirical Methods in Natural Language Processing, pages 4208–4219. Marco Marelli, Stefano Menini, Marco Baroni, Luisa Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture Bentivogli, Raffaella Bernardi, Roberto Zamparelli, 6.5-rmsprop: Divide the gradient by a running av- et al. 2014. A sick cure for the evaluation of com- erage of its recent magnitude. COURSERA: Neural positional distributional semantic models. In LREC, networks for machine learning, 4(2):26–31. pages 216–223. Huy Nguyen Tien, Minh Nguyen Le, Yamasaki To- Tom Minka and Stephen Robertson. 2008. Selection mohiro, and Izuha Tatsuya. 2018. Sentence mod- bias in the letor datasets. In SIGIR Workshop on eling via multiple word embeddings and multi-level Learning to Rank for Information Retrieval, pages comparison for semantic textual similarity. arXiv 48–51. Citeseer. preprint arXiv:1805.07882. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Antonio Torralba and Alexei A Efros. 2011. Unbiased Jing Zhu. 2002. Bleu: a method for automatic eval- look at dataset bias. In Computer Vision and Pat- uation of machine translation. In Proceedings of tern Recognition (CVPR), 2011 IEEE Conference the 40th annual meeting on association for compu- on, pages 1521–1528. IEEE. tational linguistics, pages 311–318. Association for Computational Linguistics. Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016a. Learning to rank with se- Jeffrey Pennington, Richard Socher, and Christopher lection bias in personal search. In Proceedings of the Manning. 2014. Glove: Global vectors for word 39th International ACM SIGIR conference on Re- representation. In Proceedings of the 2014 confer- search and Development in Information Retrieval, ence on empirical methods in natural language pro- pages 115–124. ACM. cessing (EMNLP), pages 1532–1543. Xuanhui Wang, Nadav Golbandi, Michael Bendersky, Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Donald Metzler, and Marc Najork. 2018. Position 2014. Deepwalk: Online learning of social rep- bias estimation for unbiased learning to rank in per- resentations. In Proceedings of the 20th ACM sonal search. In Proceedings of the Eleventh ACM SIGKDD international conference on Knowledge International Conference on Web Search and Data discovery and data mining, pages 701–710. ACM. Mining, pages 610–618. ACM.

4427 Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. A.2 Features Used in Unlexicalized Bilateral multi-perspective matching for natural lan- guage sentences. In Proceedings of the 26th Inter- We list the 15 features we used in method Unlex- national Joint Conference on Artificial Intelligence, icalized in Section 2.1. We use 3 types of unlexi- pages 4144–4150. AAAI Press. calized features (Bowman et al., 2015): Zhiguo Wang, Haitao Mi, and Abraham Ittycheriah. • The BLEU score of both sentences, using n- 2016b. Sentence similarity learning by lexical de- gram length from 1 to 4, which are totally 4 composition and composition. In Proceedings of features. COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, • The length difference between the two sen- pages 1340–1349. tences, as one real-valued feature. Adina Williams, Nikita Nangia, and Samuel Bowman. • The number and percentage of overlap words 2018. A broad-coverage challenge corpus for sen- between both sentences over all words and tence understanding through inference. In Proceed- over just nouns, verbs, adjectives and ad- ings of the 2018 Conference of the North American Chapter of the Association for Computational Lin- verbs, which are totally 10 features. guistics: Human Language Technologies, Volume 1 A.3 Features Used in Advanced (Long Papers), volume 1, pages 1112–1122. We list the features we used in method Advanced Bianca Zadrozny. 2004. Learning and evaluating clas- in Section 2.1. As mentioned above, if we use a sifiers under sample selection bias. In Proceedings node to represent a sentence and add an undirected of the twenty-first international conference on Ma- chine learning, page 114. ACM. edge if two sentences are compared in the dataset, the whole dataset can be viewed as a graph as il- Tao Zhou, Linyuan Lu,¨ and Yi-Cheng Zhang. 2009. lustrated in Figure3. To classify the edges in the Predicting missing links via local information. The graph, we use 3 types of graph-based features: European Physical Journal B, 71(4):623–630. • The origin and extended leakage features: de- A Detailed Settings for the Experiments grees of both nodes, number of 2-hop and in Section 2.1 3-hop paths between the two nodes, number of 2-hop and 3-hop neighbors of both nodes, A.1 Dataset Description which are totally 8 features. We summarize the statistics of the datasets used in • The element-wise product and dot product of Section2 in Table5. Deepwalk (Perozzi et al., 2014) embedding Dataset Training Testing # classes of the two nodes, all together as 65 features. SNLI 549k 10k 3 • The resource allocation index, Jaccard co- MultiNLI 393k 10k 3 efficient, preferential attachment score and QuoraQP 384k 10k 2 Adamic-Adar index (Zhou et al., 2009; MSRP 4k 2k 2 Liben-Nowell and Kleinberg, 2007) of both SICK 5k 5k 2/3 two nodes, which are totally 4 features. ByteDance 256k 32k 3 B Proof for the Theorems

Table 5: Information about the datasets. B.1 Derivation of Equation (1) Here we present the derivation of Equation (1). For SICK, both entailment label and Proof. relatedness score are provided. We use P (Y = 1|l) Db the sentence pairs with relatedness score =P (Y = 1|S = Y, l) greater than 3.6 as duplicated, and otherwise P (Y = 1,S = 1|l) = not duplicated. This threshold gives roughly P (Y = 1,S = 1|l) + P (Y = 0,S = 0|l) 50% of positive pairs and 50% negative pairs. P (Y = 1|l)P (S = 1|l) = For ByteDance, since no existing dataset par- P (Y = 1|l)P (S = 1|l) + P (Y = 0|l)P (S = 0|l) tition is available, we randomly divide the dataset P (Y = 1)P (S = 1|l) = . into a training set, a validation set, and a testing set P (Y = 1)P (S = 1|l) + P (Y = 0)P (S = 0|l) in a ratio of 8:1:1. We use the sentences in English By solving the above equation, we have the result during our experiments. in Equation (1).

4428 B.2 Proof of Theorem1 Here we present the proof for Theorem1, i.e., the unbiased expectation theorem. Proof. h i E w∆f(x, l), y x,y,l∼Db Z P (S = Y ) = ∆(f(x, l), y)dP (x, y, l) P (S = y|l) Db Z P (S = Y ) = ∆(f(x, l), y) dP (x, y, l|S = Y ) P (S = y|l) Z P (S = Y ) P (S = y|x, y, l)dP (x, y, l) = ∆(f(x, l), y) P (S = y|l) P (S = Y ) Z = ∆(f(x, l), y)dP (x, y, l) h i =Ex,y,l∼D ∆(f(x, l), y) .

As illustrated above, by adding specific weights to the samples, we can obtain the loss unbiased to the leakage neutral distribution D. The unbiased loss can be used for both training and evaluation.

4429