Arxiv:2104.07908V1 [Cs.CL] 16 Apr 2021

MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning Mengzhou Xiax∗ Guoqing Zhengz Subhabrata Mukherjeez Milad Shokouhiz Graham Neubigy Ahmed Hassan Awadallahz xPrincenton University yCarnegie Mellon University zMicrosoft Research [email protected], {zheng, submukhe, milads}@microsoft.com [email protected], [email protected] Abstract Joint Training MetaXL en en tel tel The combination of multilingual pre-trained representations and cross-lingual transfer learning is one of the most effective methods for building functional NLP systems for low- resource languages. However, for extremely low-resource languages without large-scale monolingual corpora for pre-training or (a) (b) sufficient annotated data for fine-tuning, transfer learning remains an under-studied Figure 1: First two principal components of sequence and challenging task. Moreover, recent work representations (corresponding to [CLS] tokens) of shows that multilingual representations are Telugu and English examples from a jointly fine-tuned surprisingly disjoint across languages (Singh mBERT and a MetaXL model for the task of sentiment et al., 2019), bringing additional challenges analysis. MetaXL pushes the source (EN) and target for transfer onto extremely low-resource (TEL) representations closer to realize a more effective languages. In this paper, we propose MetaXL, transfer. The Hausdorff distance between the source a meta-learning based framework that learns and target representations drops from 0:57 to 0:20 with to transform representations judiciously from F1 score improvement from 74:07 to 78:15. auxiliary languages to a target one and brings their representation spaces closer for effective transfer. Extensive experiments on real-world Wikipedia and XLM-R (Conneau et al., 2020) is low-resource languages – without access pre-trained on 100 languages with CommonCrawl to large-scale monolingual corpora or large Corpora. However, these models still leave behind amounts of labeled data – for tasks like cross- more than 200 languages with few articles avail- lingual sentiment analysis and named entity able in Wikipedia, not to mention the 6; 700 or so recognition show the effectiveness of our ap- languages with no Wikipedia text at all (Artetxe proach. Code for MetaXL is publicly available at github.com/microsoft/MetaXL. et al., 2020). Cross-lingual transfer learning for these extremely low-resource languages is essen- 1 Introduction tial for better information access but under-studied in practice (Hirschberg and Manning, 2015). Re- Recent advances in multilingual pre-trained repre- cent work on cross-lingual transfer learning using arXiv:2104.07908v1 [cs.CL] 16 Apr 2021 sentations have enabled success on a wide range of pre-trained representations mainly focuses on trans- natural language processing (NLP) tasks for many ferring across languages that are already covered languages. However, these techniques may not by existing representations (Wu and Dredze, 2019). readily transfer onto extremely low-resource lan- In contrast, existing work on transferring to languages, where: (1) large-scale monolingual cor- guages without significant monolingual resources pora are not available for pre-training and (2) suf- tends to be more sparse and typically focuses on ficient labeled data is lacking for effective fine- specific tasks such as language modeling (Adams tuning for downstream tasks. For example, mul- et al., 2017) or entity linking (Zhou et al., 2019). tilingual BERT (mBERT) (Devlin et al., 2018) is Building NLP systems in these settings is chal- pre-trained on 104 languages with many articles on lenging for several reasons. First, a lack of suf- ∗Most of the work was done while the first author was an ficient annotated data in the target language pre- intern at Microsoft Research. vents effective fine-tuning. Second, multilingual pre-trained representations are not directly trans- languages 1 and then apply it to the target language. ferable due to language disparities. Though recent This is widely adopted in the zero-shot transfer work on cross-lingual transfer mitigates this chal- setup where no annotated data is available in the lenge, it still requires a sizeable monolingual cor- target language. The practice is still applicable in pus to train token embeddings (Artetxe et al., 2019). the few-shot setting, in which case a small amount As noted, these corpora are difficult to obtain for of annotated data in the target language is available. many languages (Artetxe et al., 2020). In this work, we focus on cross-lingual trans- Additionally, recent work (Singh et al., 2019) fer for extremely low-resource languages where shows that contextualized representations of dif- only a small amount of unlabeled data and task- ferent languages do not always reside in the same specific annotated data are available. That includes space but are rather partitioned into clusters in mul- languages that are not covered by multilingual lan- tilingual models. This representation gap between guage models like XLM-R (e.g., Maori or Turk- languages suggests that joint training with com- men), or low-resource languages that are covered bined multilingual data may lead to sub-optimal but with many orders of magnitude less data for transfer across languages. This problem is further pre-training (e.g., Telegu or Persian). We assume exacerbated by the, often large, lexical and syn- the only target-language resource we have access tactic differences between languages with existing to is a small amount of task-specific labeled data. pre-trained representations and the extremely low- More formally, given: (1) a limited amount of resource ones. Figure1(a) provides a visualization annotated task data in the target language, denoted (i) (i) of one such example of the disjoint representations as Dt = f(xt ; yt ); i 2 [1;N]g, (2) a larger of a resource-rich auxiliary language (English) and amount of annotated data from one or more source (j) (j) resource-scarce target language (Telugu). language(s), denoted as Ds = f(xs ; ys ); j 2 We propose a meta-learning based method, [1;M]g where M N and (3) a pre-trained MetaXL, to bridge this representation gap and al- model fθ, which is not necessarily trained on any low for effective cross-lingual transfer to extremely monolingual data from the target language – our low-resource languages. MetaXL learns to trans- goal is to adapt the model to maximize the perfor- form representations from auxiliary languages in a mance on the target language. way that maximally facilitates transfer to the target When some target language labeled data is avail- language. Concretely, our meta-learning objective able for fine-tuning, a standard practice is to jointly encourages transformations that increase the align- fine-tune (JT) the multilingual language model us- ment between the gradients of the source-language ing a concatenation of the labeled data from both set with those of a target-language set. Figure1(b) the source and target languages Ds and Dt. The shows that MetaXL successfully brings representa- representation gap (Singh et al., 2019) between the tions from seemingly distant languages closer for source language and target language in a jointly more effective transfer. trained model brings additional challenges, which We evaluate our method on two tasks: named motivates our proposed method. entity recognition (NER) and sentiment analysis 2.2 Representation Transformation (SA). Extensive experiments on 8 low-resource languages for NER and 2 low-resource languages for The key idea of our approach is to explicitly learn SA show that MetaXL significantly improves over to transform source language representations, such strong baselines by an average of 2.1 and 1.3 F1 that when training with these transformed repre- score with XLM-R as the multilingual encoder. sentations, the parameter updates benefit perfor- mance on the target language the most. On top of an existing multilingual pre-trained model, we 2 Meta Representation Transformation introduce an additional network, which we call the representation transformation network to model 2.1 Background and Problem Definition this transformation explicitly. The standard practice in cross-lingual transfer learn- The representation transformation network mod- g : d ! d d ing is to fine-tune a pre-trained multilingual lan- els a function φ R R , where is the di- guage model fθ parameterized by θ, (e.g. XLM-R 1We also refer to auxiliary languages as source languages and mBERT) with data from one or more auxiliary as opposed to target languages. training loss meta loss Repeat following steps until convergence: Transformer Transformer Layer Layer Forward pass for training loss Representatio ① nTransformatio n Network Backward pass from training loss (RTN) and XLM-R update Transformer ② Transformer (gradients dependency on RTN kept) Layer Layer Forward pass for meta loss Updated XLM-R ③ Embeddings (a function of RTN) Embeddings Backward pass from meta loss ④ and RTN update source data target data Figure 2: Overview of MetaXL. For illustration, only two Transformer layers are shown for XLM-R, and the representation transformation network is placed after the first Transformer layer. 1 source language data passes through the first Transformer layer, through the current representation transformation network, and finally through the remaining layers to compute a training loss with the corresponding source labels. 2 The training loss is back- propagated onto all parameters, but only parameters of

Load more