Arxiv:2108.13854V1 [Cs.CL] 31 Aug 2021 (Source Domain) and Test Data (Target Domain)

Contrastive Domain Adaptation for Question Answering using Limited Text Corpora Zhenrui Yue Bernhard Kratzwald Stefan Feuerriegel Technical University of Munich ETH AI Center ETH Zurich [email protected] [email protected] [email protected] Abstract Question generation has recently shown im- pressive results in customizing question answering (QA) systems to new domains. These approaches circumvent the need for manu- ally annotated training data from the new domain and, instead, generate synthetic question- answer pairs that are used for training. How- ever, existing methods for question generation rely on large amounts of synthetically generated datasets and costly computational re- sources, which render these techniques widely inaccessible when the text corpora is of limited size. This is problematic as many niche domains rely on small text corpora, which nat- urally restricts the amount of synthetic data that can be generated. In this paper, we propose a novel framework for domain adaptation called contrastive domain adaptation for QA Figure 1: Overview of a common framework for QA (CAQA). Specifically, CAQA combines tech- domain adaptation. A question generation model is niques from question generation and domain- used to generate synthetic target data, which can be invariant learning to answer out-of-domain used for training the QA model with source data. The questions in settings with limited text corpora. resulting QA model can answer target questions upon Here, we train a QA system on both source deployment. data and generated data from the target domain with a contrastive adaptation loss that is incor- porated in the training objective. By combin- tive QA, both question and context are represented ing techniques from question generation and by running text, while the answer is defined by a domain-invariant learning, our model achieved start position and an end position in the context. considerable improvements compared to state- An existing challenge for extractive QA systems of-the-art baselines. is the distributional change between training data arXiv:2108.13854v1 [cs.CL] 31 Aug 2021 (source domain) and test data (target domain). If 1 Introduction there is such a distribution change, the performance Question answering (QA) systems generate an- on test data is likely to be impaired. In practice, swers to questions over text. Formally, such sys- this issue occurs due to the fact that users, for in- tems are nowadays trained end-to-end to predict stance, formulate text in highly diverse language answers conditional on an input question and a con- or use QA for previously unseen domains (Hazen text paragraph (e.g., Seo et al., 2016; Chen et al., et al., 2019; Miller et al., 2020). As a result, out-of- 2017a; Devlin et al., 2019). Therein, every QA domain (OOD) samples occur that diverge from the sample is a 3-tuple consisting of a question, a con- training corpora of QA systems (i.e., which can be text, and an answer. In this paper, we consider the traced back to the invariance of the training data) subproblem of extractive QA, where the task is to and, upon deployment, lead to a drastic drop in the extract answer spans from an unstructured context accuracy of QA systems. information for a given question as input. In extrac- One solution to the above-mentioned challenge of a domain shift is to generate synthetic data from while it simultaneously separates answer tokens for the corpora of the target domain using models for answer extraction.1 question generations and then use the synthetic The main contributions of our work are: data during training (e.g., Lee et al., 2020; Shakeri 1. We propose a novel framework for domain et al., 2020). For this purpose, generative mod- adaptation in QA called CAQA. To the best of els have been adopted to produce synthetic data our knowledge, this is the first use of contrastive as surrogates from target domain, so that the QA approaches for learning domain-invariant fea- system can be trained with both data from the tures in QA systems. source domain and synthetic data, which helps to 2. Our CAQA framework is particularly effective achieve better results on the out-of-domain data dis- for limited text corpora. In such settings, we tribution (Puri et al., 2020; Lee et al., 2020; Shak- show that CAQA can transfer knowledge to eri et al., 2020), see Figure1 for an overview of target domain without additional training cost. such approach. Nevertheless, large quantities of 3. We demonstrate that CAQA can effectively an- synthetic data require intensive computational re- swer out-of-domain questions. CAQA outper- sources. Moreover, many niche domains rely upon forms the current state-of-the-art baselines for limited text corpora. Their limited size puts bar- domain adaptation by a significant margin. riers to the amount of synthetic data that can be generated and, as well shall see later, render the 2 Related Work aforementioned approach for limited text corpora The performance of extractive question answer- largely ineffective. ing systems (e.g., Chen et al., 2017b; Kratzwald In computer vision, some works draw upon an- et al., 2019; Zhang et al., 2020) is known to deterio- other approach for domain adaptation, namely dis- rate when the training data (source domain) differs crepancy reduction of representations (Long et al., from the data used during testing (target domain) 2013; Tzeng et al., 2014; Long et al., 2015, 2017; (Talmor and Berant, 2019). Approaches to adapt Kang et al., 2019). Here, an adaptation loss or ad- QA systems to a certain domain can be divided in versarial training approaches are often designed to (1) supervised approaches, where one has access learn domain-invariant features, so that the model to labeled data from the target domain (i.e., trans- can transfer learnt knowledge from the source do- fer learning; Kratzwald and Feuerriegel, 2019), or main to the target domain. However, the afore- (2) unsupervised approaches, where no labeled in- mentioned approach for domain adaptation was formation is accessible. The latter is our focus. designed for computer vision tasks, and, to the best Unsupervised approaches are primarily based on of our knowledge, has not yet been tailored for QA. question generation techniques where one gener- In this paper, we develop a framework for an- ates synthetic training data for the target domain. swering out-of-domain questions in QA settings Question generation (QG): Question genera- with limited text corpora. We refer to our proposed tion (Rus et al., 2010) is the task of generating framework as contrastive domain adaptation for synthetic QA pairs from raw text data. Several ap- question answering (CAQA). CAQA combines proaches have been developed to generate synthetic question generation and contrastive domain adap- questions in QA. Du et al.(2017) propose an end- tation to learn domain-invariant features, so that it to-end seq2seq encoder-decoder for the generation. can capture both domains and thus transfer knowl- Recently, question generation and answer genera- edge to the target distribution. This is in contrast to tion are observed as dual tasks and combined in existing question generation where synthetic data various ways. Tang et al.(2017) train both simul- is solely used for joint training with the source data taneously; Golub et al.(2017) split the process in but without explicitly accounting for domain shifts, two consecutive stages; and Tang et al.(2018) use thus explaining why CAQA improves the perfor- policy gradient to improve between-task learning. mance in answering out-of-domain questions. For Question generation is a common technique for this, we propose a novel contrastive adaptation loss domain adaptation in QA. Here, the generated ques- that is tailored to QA. The contrastive adaptation tions are used to fine-tune QA systems to the new loss uses maximum mean discrepancy (MMD) to target domain (Dhingra et al., 2018). Oftentimes, measure the discrepancy in the representation be- 1The code from our CAQA framework is publicly available tween source and target features, which is reduced via https://github.com/Yueeeeeeee/CAQA only a subset of generated questions is selected to input is given by: increase the quality of the generated data. Com- mon approaches are based on curriculum learning • Training data from source domain: We are given labeled data from the source domain Xs. Each (Sachan and Xing, 2018); roundtrip consistency, (i) where samples are selected when the predicted an- sample xs 2 Xs from the source domain Ds (i) swers match the generated answer (Alberti et al., comprises of a 3-tuple with a question xs;q, a (i) (i) 2019); iterative refinement (Li et al., 2020); and context xs;c, and an answer xs;a. conditional priors (Lee et al., 2020). Target contexts Unsupervised domain adaptation: A large • : We have access to target domain body of work on unsupervised domain adaptation data. Yet, of note, the data is unlabeled. That has been done in the area of computer vision, where is, we have only access to the contexts. We fur- ther assume that the amount of target contexts is the representation discrepancy between a labeled 0 limited. Let Xt denote the unlabeled target data, source dataset and an unlabeled target dataset is (i) 0 reduced (e.g., Tzeng et al., 2014; Saito et al., 2018; where each sample xt 2 Xt from the target (i) Long et al., 2015). Recent approaches are often domain Dt consists of only a context xt;c. based on adversarial learning, where one minimizes Objective: Upon deployment, we aim at maxi- the distance between feature distributions in both mizing the performance of the QA system when both the source and target domain, while simulta- answering questions from the target domain D , neously minimizing the error in the labeled source t that is, minimizing the cross-entropy loss of the domain (e.g., Long et al., 2017; Tzeng et al., 2017).

Arxiv:2108.13854V1 [Cs.CL] 31 Aug 2021 (Source Domain) and Test Data (Target Domain)

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support