Leveraging Adversarial Training in Self-Learningfor Cross-Lingual Text Classification

Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text Classification Xin Dong1, Yaxin Zhu1, Yupeng Zhang1, Zuohui Fu1, Dongkuan Xu2, Sen Yang3, Gerard de Melo1 1 Rutgers University 2 The Pennsylvania State University 3 LinkedIn {xd48,yz956,yupeng.zhang,zuohui.fu}@rutgers.edu,[email protected],[email protected],[email protected] ABSTRACT can be challenging to collect new task-specific training data for In cross-lingual text classification, one seeks to exploit labeled data each new language that is to be supported. from one language to train a text classification model that can then To overcome this, cross-lingual systems rely on training data be applied to a completely different language. Recent multilingual from a source language to train a model that can be applied to representation models have made it much easier to achieve this. entirely different target languages [3], alleviating the training bot- Still, there may still be subtle differences between languages that tleneck issues for low-resource languages. Traditional cross-lingual are neglected when doing so. To address this, we present a semi- text classification approaches have often relied on translation dic- supervised adversarial training process that minimizes the maximal tionaries, lexical knowledge graphs, or parallel corpora to find loss for label-preserving input perturbations. The resulting model connections between words and phrases in different languages then serves as a teacher to induce labels for unlabeled target lan- [3]. Recently, based on deep neural approaches such as BERT [4], guage samples that can be used during further adversarial training, there have been important advances in learning joint multilingual allowing us to gradually adapt our model to the target language. representations with self-supervised objectives [1, 4, 11]. These Compared with a number of strong baselines, we observe signifi- have enabled substantial progress for cross-lingual training, by cant gains in effectiveness on document and intent classification mapping textual inputs from different languages into a common for a diverse set of languages. vector representation space [16]. With models such as Multilingual BERT [4], the obtained vector representations for English and Thai CCS CONCEPTS language documents, for instance, will be similar if they discuss similar matters. • Computing methodologies ! Natural language processing; Still, recent empirical studies [12, 19] show that these represen- • Information systems ! Document representation. tations do not bridge all differences between different languages. While it is possible to invoke multilingual encoders to train a model KEYWORDS on English training data and then apply it to documents in a lan- Text Classification, Multilingual, Cross-Lingual, Semantics guage such as Thai, the model may not work as well when applied to Thai document representations, since the latter are likely to diverge ACM Reference Format: Xin Dong, Yaxin Zhu, Yupeng Zhang, Zuohui Fu, Dongkuan Xu, Sen Yang, from the English representation distribution in subtle ways. Gerard de Melo. 2020. Leveraging Adversarial Training in Self-Learning for In this work, we propose a semi-supervised adversarial pertur- Cross-Lingual Text Classification. In Proceedings of the 43rd International bation framework that encourages the model to be more robust ACM SIGIR Conference on Research and Development in Information Retrieval towards such divergence and better adapt to the target language. (SIGIR ’20), July 25–30, 2020, Virtual Event, China. ACM, New York, NY, USA, Adversarial training is a method to learn to resist small adversarial 4 pages. https://doi.org/10.1145/3397271.3401209 perturbations that are added to the input so as to maximize the loss incurred by neural networks [9, 20]. Nevertheless, the gains 1 INTRODUCTION observed from adversarial training in previous work have been arXiv:2007.15072v1 [cs.CL] 29 Jul 2020 limited, because it is merely invoked as a form of monolingual regu- Background. Text classification has become a fundamental build- larization. Our results show that adversarial training is particularly ing block in modern information systems, and there is an increasing fruitful in a cross-lingual framework that also exploits unlabeled need to be able to classify texts in a wide range of languages. How- data via self-learning. ever, as organizations target an increasing number of markets, it Overview and Contributions. Our model begins by learning just from available source language samples, drawing on a multilin- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed gual encoder with added adversarial perturbation. Without loss of for profit or commercial advantage and that copies bear this notice and the full citation generality, in the following, we assume English to be the source on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, language. After training on English, subsequently, we use the same to post on servers or to redistribute to lists, requires prior specific permission and/or a model to make predictions on unlabeled non-English samples and fee. Request permissions from [email protected]. a part of those samples with high confidence prediction scores are SIGIR ’20, July 25–30, 2020, Virtual Event, China © 2020 Association for Computing Machinery. repurposed to serve as labeled examples for a next iteration of ACM ISBN 978-1-4503-8016-4/20/07...$15.00 adversarial training until the model converges. https://doi.org/10.1145/3397271.3401209 The adversarial perturbation improves robustness and gener- been deliberately constructed to fool the network into making a alization by regularizing our model. At the same time, because misclassification [20]. Adversarial training is based on the notion of adversarial training makes tiny perturbations that barely affect the making the model robust against such perturbation, i.e., against an prediction result, the perturbations on words during self-learning imagined adversary that seeks out minuscule changes to an input can be viewed as inducing a form of code-switching, which re- that lead to a misclassification, assuming that the class label should places some original source language words with potential nearby not actually be affected by such minuscule changes. To perform non-English word representations. adversarial training, the loss function becomes: Based on this combination of adversarial training and semi- L ¹xi ;yi º = L¹f ¹xi + r ;θº;yi º (1) supervised self-learning techniques, the model evolves to become adv adv ˜ more robust with regard to differences between languages. We where radv = argmax L¹f ¹xi + r;θº;yi º: demonstrate the superiority of our framework on Multilingual r; j jrj j ≤ϵ Document Classification (MLDoc) [18] in comparison with state-of- Here r is a perturbation on the input and θ˜ is a set of parameters set the-art baselines. Our study then proceeds to show that our method to match the current parameters of the entire network, but ensuring outperforms other methods on cross-lingual dialogue intent clas- that gradient propagation only proceeds through the adversarial sification from English to Spanish and Thai[17]. This shows that example construction process. At each step of training, the worst our semi-supervised adversarial framework is more effective than case perturbations radv are calculated against the current model previous approaches at cross-lingual transfer for domain-specific f ¹xi ;θ˜º in Equation 1, and we train the model to be robust to such tasks, based on a mix of labeled and unlabeled data via adversarial perturbations by minimizing Equation 1 with respect to θ. We later training on multilingual representations. empirically confirm that adding random noise instead of seeking such adversarial worst case perturbations is not able to bring about 2 METHOD similar gains (Section 3.2). Overview of the Method. Our proposed method consists of two Generally, we cannot obtain a closed form for the exact pertur- parts, as illustrated in Figure 1. The backbone is a multilingual bation radv, but Goodfellow et al. [9] proposed to approximate this ¹· º ˜ classifier, which includes a pretrained multilingual encoder fn ;θn worst case perturbation radv by linearizing f ¹xi ;θº around xi . With and a task-specific classification module fcl¹·;θclº. By adopting an a linear approximation and an L2 norm constraint in Equation 2, encoder that (to a certain degree) shares representations across the adversarial perturbation is d languages, we obtain a universal text representation h 2 R , where g d is the dimensionality of the text representation. The classification radv ≈ ϵ (2) jjgjj2 module fcl¹·;θclº is applied for fine-tuning on top of the pretrained d where g = rx L¹f ¹xi ;θ˜ºº model fn¹·;θnº. It applies a linear function to map h 2 R into i RjY j, and a softmax function, where Y is the set of target classes. During the actual training, we optimize the loss function of the adversarial training in Equation 1 based on the adversarial pertur- Adversarial Unlabeled bation defined by Equation 2 in each step. Perturbations Dataset Self-Learning. Subsequently, in order to encourage the model to adapt specifically to the target language, the next step is tomake Training Multilingual Selection predictions for the unlabeled instances in U = fxu j u = 1; :::;mg. Dataset Classifier Mechanism We can then incorporate unlabeled target language data with high classification confidence scores into the training set. To ensure Adversarial Training robustness, we adopt a balanced selection mechanism, i.e., we first select a separate subset fxs j s = 1; :::; Ktg of the unlabeled data for Merge Selected Dataset each class, consisting of the top Kt highest confidence items based on the current trained model. The union set Us of selected items is merged into the training set L and then we retrain the model, again Figure 1: Illustration of self-learning process with adversar- with adversarial perturbation.

Leveraging Adversarial Training in Self-Learningfor Cross-Lingual Text Classification

Champollion: a Robust Parallel Text Sentence Aligner

The Web As a Parallel Corpus

Resourcing Machine Translation with Parallel Treebanks John Tinsley

Towards Controlled Counterfactual Generation for Text

Natural Language Processing Security- and Defense-Related Lessons Learned

A User Interface-Level Integration Method for Multiple Automatic Speech Translation Systems

Lines: an English-Swedish Parallel Treebank

Cross-Lingual Bootstrapping of Semantic Lexicons: the Case of Framenet

Parallel Texts

Open Architecture for Multilingual Parallel Texts M.T

Unsupervised Learning of Cross-Modal Mappings Between

BITS: a Method for Bilingual Text Search Over the Web