PERL: Pivot-Based Domain Adaptation for Pre-Trained Deep Contextualized Embedding Models

PERL: Pivot-based Domain Adaptation for Pre-trained Deep Contextualized Embedding Models Eyal Ben-David ∗ Carmel Rabinovitz ∗ Roi Reichart Technion, Israel Institute of Technology {[email protected]@campus.jroiri@}technion.ac.il Abstract et al., 2010; Jiang and Zhai, 2007; McClosky et al., 2010; Rush et al., 2012; Schnabel and Schütze, Pivot-based neural representation models 2014). Our focus in this paper is on unsuper- have lead to significant progress in do- vised DA, the setup we consider most realistic. In main adaptation for NLP. However, previous this setup labeled data is available only from the works that follow this approach utilize only labeled data from the source domain and un- source domain while unlabeled data is available labeled data from the source and target do- from both the source and the target domains. mains, but neglect to incorporate massive While various approaches for DA have been unlabeled corpora that are not necessarily proposed (§2), with the prominence of deep neu- drawn from these domains. To alleviate this, ral network (DNN) modeling, attention has been we propose PERL: A representation learn- recently focused on representation learning ap- ing model that extends contextualized word embedding models such as BERT (Devlin proaches. Within representation learning for un- et al., 2019) with pivot-based fine-tuning. supervised DA, two approaches have been shown PERL outperforms strong baselines across particularly useful. In one line of work, DNN- 22 sentiment classification domain adap- based methods which employ compress-based tation setups, improves in-domain model noise reduction to learn cross-domain features performance, yields effective reduced-size have been developed (Glorot et al., 2011; Chen 12 models and increases model stability. et al., 2012). In another line of work, methods based on the distinction between pivot and non- 1 Introduction pivot features (Blitzer et al., 2006, 2007) learn a joint feature representation for the source and Natural Language Processing (NLP) algorithms the target domains. Later on, Ziser and Reichart are constantly improving, gradually approaching (2017, 2018), and Li et al.(2018) married the human level performance (Dozat and Manning, two approaches and achieved substantial improve- 2017; Edunov et al., 2018; Radford et al., 2018). ments on a variety of DA setups. However, those algorithms often depend on the Despite their success, pivot-based DNN mod- availability of large amounts of manually anno- els still only utilize labeled data from the source tated data from the domain where the task is per- domain and unlabeled data from both the source formed. Unfortunately, collecting such annotated and the target domains, but neglect to incorporate data is often costly and laborious, which substan- arXiv:2006.09075v1 [cs.CL] 16 Jun 2020 massive unlabeled corpora that are not necessarily tially limits the applicability of NLP technology. drawn from these domains. With the recent game- Domain Adaptation (DA), training an algorithm changing success of contextualized word embed- on annotated data from a source domain so that ding models trained on such massive corpora (De- it can be effectively applied to other target do- vlin et al., 2019; Peters et al., 2018), it is natu- mains, is one of the ways to solve the above bot- ral to ask whether information from such corpora tleneck. Indeed, over the years substantial efforts can enhance these DA methods, particularly that have been devoted to the DA challenge (Roark and background knowledge from non-contextualized Bacchiani, 2003; III and Marcu, 2006; Ben-David embeddings has shown useful for DA (Plank and ∗ * Both authors equally contributed to this work. Moschitti, 2013; Nguyen et al., 2015). 1Our code is at https://github.com/eyalbd2/ PERL. In this paper we hence propose an unsupervised 2This paper was accepted to TACL in June 2020 DA approach that extends leading approaches based on DNNs and pivot-based ideas, so that 2 Background and Previous Work they can incorporate information encoded in massive corpora (§3). Our model, named PERL: There are several approaches to DA, including in- Pivot-based Encoder Representation of Language, stance re-weighting (Sugiyama et al., 2007; Huang builds on massively pre-trained contextualized et al., 2006; Mansour et al., 2008), sub-sampling word embedding models such as BERT (Devlin from the participating domains (Chen et al., 2011) et al., 2019). To adjust the representations learned and DA through representation learning, where a by these models so that they close the gap be- joint representation is learned based on texts from tween the source and target domains, we fine- the source and target domains (Blitzer et al., 2007; tune their parameters using a pivot-based variant Xue et al., 2008; Ziser and Reichart, 2017, 2018). of the Masked Language Modeling (MLM) objec- We first describe the unsupervised DA pipeline, tive, optimized on unlabeled data from both the continue with representation learning methods for source and the target domains. We further present DA with a focus on pivot-based methods, and, fi- R-PERL (regularized PERL) which facilitates pa- nally, describe contextualized embedding models. rameter sharing for pivots with similar meaning. Unsupervised Domain Adaptation through Representation Learning As said in §1 our fo- We perform extensive experimentation in vari- cus in this work is on unsupervised DA through ous unsupervised DA setups of the task of binary representation learning. A common pipeline for sentiment classification (§4,5). First, for com- this setup consists of two steps: (A) Learning a patibility with previous work, we experiment with representation model (often referred to as the en- the legacy product review domains of Blitzer et al. coder) using the source and target unlabeled data; (2007) (12 setups). We then experiment with more and (B) Training a supervised classifier on the challenging setups, adapting between the above source domain labeled data. To facilitate domain domains and the airline review domain (Nguyen, adaptation, every text fed to the classifier in the 2015) used in Ziser and Reichart(2018) (4 se- second step is first represented by the pre-trained tups), as well as the IMDB movie review domain encoder. This is performed both when the classi- (Maas et al., 2011) (6 setups). We compare PERL fier is trained in the source domain and when it is to the best performing pivot-based methods (Ziser applied to new text from the target domain. and Reichart, 2018; Li et al., 2018) and to DA Exceptions to this pipeline are end-to-end mod- approaches that fine-tune a massively pre-trained els that jointly learn to perform the cross-domain BERT model by optimizing its standard MLM ob- text representation and the classification task. This jective using target-domain unlabeled data (Lee is achieved by training a unified objective on the et al., 2020; Han and Eisenstein, 2019). PERL source domain labeled data and the unlabeled data and R-PERL substantially outperform these base- from both the source and the target. Among these lines, emphasizing the additive effect of massive models are domain adversarial networks (Ganin pre-training and pivot-based fine-tuning. et al., 2016), which were strongly outperformed by Ziser and Reichart(2018) to which we compare our methods, and the hierarchical attention As an additional contribution, we show that transfer network (HATN, (Li et al., 2018)), which pivot-based learning is effective beyond improv- is one of our baselines (see below). ing domain adaptation accuracy. Particularly, we Unsupervised DA through representation learn- show that an in-domain variant of PERL substan- ing has followed two main avenues. The first av- tially improves the in-domain performance of a enue consists of works that aim to explicitly build BERT-based sentiment classifier, for varying train- a feature representation that bridges the gap be- ing set sizes (from 100 to 20K labeled examples). tween the domains. A seminal framework in this We also show that PERL facilitates the generation line is structural correspondence learning (SCL, of effective reduced-size DA models. Finally, we (Blitzer et al., 2006, 2007)), that splits the fea- perform an extensive ablation study (§6) that un- ture space into pivot and non-pivot features. A covers PERL’s crucial design choices and demon- large number of works have followed this idea strates the stability of PERL to hyper-parameter (e.g. (Pan et al., 2010; Gouws et al., 2012; Bol- selection compared to other DA methods. legala et al., 2015; Yu and Jiang, 2016; Li et al., 2017, 2018; Tu and Wang, 2019; Ziser and Re- plied to the target domain, PBLM is employed as a ichart, 2017, 2018)) and we discuss it below. contextualized word embedding layer. Notice that Works in the second avenue learn cross-domain PBLM is not pre-trained on massive out of (source representations by training autoencoders (AEs) on and target) domain corpora, and its single-layer, the unlabeled data from the source and target do- unidirectional LSTM architecture is probably not mains. This way they hope to get a more robust ideal for knowledge encoding from such corpora. representation, which is hopefully better suited for Another work in this line is HATN (Li et al., DA. Examples for such models include the stacked 2018). This model automatically learns the denoising AE (SDA, (Vincent et al., 2008; Glorot pivot/non-pivot distinction, rather than following et al., 2011), the marginalized SDA and its variants the SCL definition as Ziser and Reichart(2017, (MSDA, (Chen et al., 2012; Yang and Eisenstein, 2018) did. HATN consists of two hierarchical at- 2014; Clinchant et al., 2016)) and variational AE tention networks, P-net and NP-net. First, it trains based models (Louizos et al., 2016). the P-net on the source labeled data. Then, it de- Recently, Ziser and Reichart(2017, 2018) and codes the most prominent tokens of P-net (i.e.

Load more