Selection Bias Explorations and Debias Methods for Natural
Total Page:16
File Type:pdf, Size:1020Kb
Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets Guanhua Zhang1;2∗, Bing Bai1∗, Jian Liang1, Kun Bai1, Shiyu Chang3, Mo Yu3, Conghui Zhu2, Tiejun Zhao2 1Cloud and Smart Industries Group, Tencent, China 2Harbin Institute of Technology, China 3MIT-IBM Watson AI Lab, IBM Research, USA fguanhzhang,icebai,joshualiang,[email protected], [email protected], [email protected], fchzhu,[email protected] Abstract 1.0 Natural Language Sentence Matching 0.5 (NLSM) has gained substantial attention from both academics and the industry, and rich 0 Normalized feature 1 public datasets contribute a lot to this process. 0.0 However, biased datasets can also hurt the negative WMD S1_freq S2_freq S1S2_inter generalization performance of trained models Figure 1: Visualization for the distributions of nor- and give untrustworthy evaluation results. For malized features versus the label in QuoraQP. The many NLSM datasets, the providers select right part (in red) represents the distributions of some pairs of sentences into the datasets, duplicated pairs, and the left part (in blue) rep- and this sampling procedure can easily bring resents the distributions of not duplicated pairs. unintended pattern, i.e., selection bias. One Best viewed in color. example is the QuoraQP dataset, where some content-independent na¨ıve features are unreasonably predictive. Such features are 2017; Tien et al., 2018), including QuoraQP1, the reflection of the selection bias and termed as the “leakage features.” In this paper, we SNLI (Bowman et al., 2015), SICK (Marelli et al., investigate the problem of selection bias on six 2014), etc. These datasets provide resources for NLSM datasets and find that four out of them both training and evaluation of different algo- are significantly biased. We further propose a rithms (Torralba and Efros, 2011). training and evaluation framework to alleviate However, most of the datasets are prepared the bias. Experimental results on QuoraQP by conducting procedures involving a sampling suggest that the proposed framework can process, which can easily introduce a selection improve the generalization ability of trained bias models, and give more trustworthy evaluation (Heckman, 1977; Zadrozny, 2004). It would results for real-world adoptions. get even worse when the bias can reveal the label information, resulting in the “leakage features,” 1 Introduction which are irrelevant to the content/semantic of the sentences but are predictive to the label. One ex- Natural Language Sentence Matching (NLSM) ample is the QuoraQP, a dataset on classifying aims at comparing two sentences and identifying whether two sentences are duplicated (labeled as the relationships (Wang et al., 2017), and serves 1) or not (labeled as 0), which has been widely as the core of many NLP tasks such as question used to evaluate STS models (Gong et al., 2017; answering and information retrieval (Wang et al., Kim et al., 2018; Wang et al., 2017; Devlin et al., 2016b). Natural Language Inference (NLI) (Bow- 2018). In QuoraQP, three leakage features have man et al., 2015) and Semantic Textual Similar- been identified, including S1 freq, the number ity (STS) (Wang et al., 2016b) are both typical of occurrences of the first sentence in the dataset; NLSM problems. A large number of publicly S2 freq, the number of occurrences of the sec- available datasets have benefited the research to ond sentence; and S1S2 inter, the number of a great extent (Kim et al., 2018; Wang et al., sentences that are paired with both the first and the * Equal contributions from both authors. This work was 1https://data.quora.com/First-Quora-Dataset-Release- done when Guanhua Zhang was an intern at Tencent. Question-Pairs 4418 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4418–4429 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 1.0 31 embodied in the comparing relationships of sen- tences, and the main contributions of this paper 26 0.8 are the answers to the following questions: 21 0.6 • Does selection bias exist in other NLSM 16 S1_freq datasets? We identify four out of six pub- 0.4 11 licly available datasets that suffer from the 0.2 selection bias. 6 • Would Deep Neural Network (DNN)-based 1 0.0 1 6 11 16 21 26 31 S2_freq methods learn from the bias pattern un- intentionally? We find that Siamese-LSTM Figure 2: The averages of the labels under different models trained on QuoraQP do capture the S1 freq and S2 freq. Red squares indicate that the bias pattern. averages are close to 1, and blue squares indicate that the averages are close to 0. Best viewed in color. • Can we help the model learn the useful semantic pattern from the content with- out fitting the bias pattern? We propose second sentences in the dataset for comparison. an easy-adopting method to mitigate the bias. Figure1 shows the distributions of normalized Experiments show that this method can im- (negative) Word Mover’s Distance (WMD) (Kus- prove the generalization performance of the ner et al., 2015) and normalized leakage features trained models. versus the labels in QuoraQP. The features are • Can we build an evaluation framework all normalized to their quantiles. As illustrated, that gives us more reliable results for real- the leakage features are more predictive than the world adoption? We propose a more trust- WMD, as the differences between the distribu- worthy evaluation method that demonstrates tions of positive and negative pairs are more sig- consistent results with unbiased cross-dataset S1 freq nificant. Moreover, combining and evaluations. S2 freq can make even more accurate predic- tions as illustrated in Figure2, where we cal- The rest of the paper is organized as follows. culate the averages of the labels under different Section2 gives an empirical look at the selection S1 freq and S2 freq. We find that when both bias on a variety of NLSM datasets and analyzes features’ values are large, the pairs tend to be why the leakage features are effective. Section3 duplicated (marked in red), while when one examines whether DNN-based methods fit the bias is large and the other is small, the pairs tend to be pattern unintentionally. Section4 introduces the not duplicated (marked in blue). training and evaluation framework to alleviate the These leakage features play a critical role in biasedness. Taking QuoraQP as an example, we the QuoraQP competition2. As the evaluations are report the experimental results in Section5. Sec- conducted with the same biased datasets, models tion6 summarizes related work, and Section7 that fit the bias pattern can take additional advan- draws the conclusion. tages over unbiased models, making the bench- 2 Empirical Study of the Selection Bias mark results untrustworthy. On the other hand, the bias pattern doesn’t exist in the real-world, so if a In this section, we investigate the problem of se- model fits the bias pattern (intentionally or unin- lection bias on six NLSM datasets and then ana- tentionally), the generalization performance will lyze why the leakage features are effective. be hurt, limiting the values of these datasets for 2.1 Quantifying the Biasedness in Datasets further applications (Torralba and Efros, 2011). In this paper, we study this problem and demon- To quantify the severity of the leakage from the strate the impact of the selection bias by a series selection bias, we formulate a toy problem for of experiments. We focus on the selection bias NLSM. We predict the semantic relationship of two sentences based on the comparing relation- 2https://www.kaggle.com/c/ ships between sentences. We refer semantic re- quora-question-pairs/discussion/ 34355 and https://www.kaggle.com/c/ lationship of two sentences as their labels, for ex- quora-question-pairs/discussion/33168 ample, duplicated for STS and entailment 4419 MultiNLI SICK Method SNLI QuoraQP MSRP ByteDance Matched Mismatched NLI STS Majority 33.7 35.6 36.5 50.00 66.5 56.7 50.3 68.59 Unlexicalized 47.7 44.9 45.5 68.20 73.9 70.1 70.2 75.23 LSTM 77.6∗ 66.9y 66.9y 82.58z 70.6 71.3> 70.2 86.45 Leakage 36.6 32.1 31.1 79.63 66.7 56.7 55.5 78.24 Advanced 39.1 32.7 33.8 80.47 67.9 57.5 56.3 85.73 Leakage vs Majority +8.61 -9.83 -14.79 +59.26 +0.30 0.00 +10.34 +14.07 Advanced vs Majority +16.02 -8.15 -7.40 +60.94 +2.11 +1.41 +11.93 +24.99 Table 1: The accuracy scores of predicting the label with unlexicalized features, leakage features, and advanced graph-based features and the relative improvements. Result with ∗ is from Bowman et al.(2015). Results with y are from Williams et al.(2018). Result with z is from Wang et al.(2017). Result with is from Shen et al.(2018). Result with > is from Baudisˇ et al.(2016). Other results are based on our implementations. “%” is omitted. Sentence1 Sentence2 3 Label ByteDance . We apply two different methods ID ID 1 2 ? to classify the edges on the graph, including 1 3 ? Leakage which uses the three leakage features 1 5 ? 2 3 ? introduced in Section1 and Advanced which uses 2 4 ? some more advanced graph-based features (Per- 2 6 ? ozzi et al., 2014; Zhou et al., 2009; Liben-Nowell Figure 3: Illustration of the graph built for Problem1. and Kleinberg, 2007) together with the three 4 We only use the comparing relationships to build the leakage features . We also report the results graph. of three baselines, including Majority which predicts the most frequent label, Unlexicalized which uses 15 handcrafted features from the for NLI, and comparing relationship as whether content of sentences (Bowman et al., 2015)(e.g., they are paired for comparison in the dataset.