From English to Code-Switching: Transfer Learning with Strong Morphological Clues
Total Page:16
File Type:pdf, Size:1020Kb
From English to Code-Switching: Transfer Learning with Strong Morphological Clues Gustavo Aguilar and Thamar Solorio Department of Computer Science University of Houston Houston, TX 77204-3010 fgaguilaralas, [email protected] Abstract Hindi-English Tweet Original: Keep calm and keep kaam se kaam !!!other #office Linguistic Code-switching (CS) is still an un- #tgif #nametag #buddhane #SouvenirFromManali #keepcalm derstudied phenomenon in natural language English: Keep calm and mind your own business !!! processing. The NLP community has mostly focused on monolingual and multi-lingual sce- Nepali-English Tweet narios, but little attention has been given to CS Original: Youtubene ma live re ,other chalcha ki vanni aash in particular. This is partly because of the lack garam !other Optimistic .other of resources and annotated data, despite its in- English: They said Youtube live, let’s hope it works! Optimistic. creasing occurrence in social media platforms. Spanish-English Tweet In this paper, we aim at adapting monolin- Original: @MROlvera06 @T11gRe go too gual models to code-switched text in various other other cavenders y tambien ve a @ElToroBoots tasks. Specifically, we transfer English knowl- ne ne other English: @MROlvera06 @T11gRe go to cavenders and edge from a pre-trained ELMo model to differ- also go to @ElToroBoots ent code-switched language pairs (i.e., Nepali- English, Spanish-English, and Hindi-English) Figure 1: Examples of code-switched tweets and using the task of language identification. Our their translations from the CS LID corpora for Hindi- method, CS-ELMo, is an extension of ELMo English, Nepali-English and Spanish-English. The LID with a simple yet effective position-aware at- labels ne and other in subscripts refer to named enti- tention mechanism inside its character convo- ties and punctuation, emojis or usernames, respectively lutions. We show the effectiveness of this (they are part of the LID tagset). English text appears transfer learning step by outperforming mul- in italics and other languages are underlined. tilingual BERT and homologous CS-unaware ELMo models and establishing a new state of the art in CS tasks, such as NER and POS tag- ging. Our technique can be expanded to more of languages. Nevertheless, code-switching often English-paired code-switched languages, pro- occurs in language pairs that include English (see viding more resources to the CS community. examples in Figure1). These aspects lead us to ex- plore approaches where English pre-trained models 1 Introduction can be leveraged and tailored to perform well on Although linguistic code-switching (CS) is a com- code-switching settings. mon phenomenon among multilingual speakers, it In this paper, we study the CS phenomenon us- is still considered an understudied area in natural ing English as a starting language to adapt our language processing. The lack of annotated data models to multiple code-switched languages, such combined with the high diversity of languages in as Nepali-English, Hindi-English and Spanish- which this phenomenon can occur makes it diffi- English. In the first part, we focus on the task cult to strive for progress in CS-related tasks. Even of language identification (LID) at the token level though CS is largely captured in social media plat- using ELMo (Peters et al., 2018) as our refer- forms, it is still expensive to annotate a sufficient ence for English knowledge. Our hypothesis is amount of data for many tasks and languages. Ad- that English pre-trained models should be able to ditionally, not all the languages have the same in- recognize whether a word belongs to English or cidence and predominance, making annotations not when such models are fine-tuned with code- impractical and expensive for every combination switched text. To accomplish that, we introduce 8033 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8033–8044 July 5 - 10, 2020. c 2020 Association for Computational Linguistics CS-ELMo, an extended version of ELMo that con- 2018; Devlin et al., 2019). CS-related tasks are tains a position-aware hierarchical attention mech- good candidates for such applications, since they anism over ELMo’s character n-gram representa- are usually framed as low-resource problems. How- tions. These enhanced representations allow the ever, previous research on sequence labeling for model to see the location where particular n-grams code-switching mainly focused on traditional ML occur within a word (e.g., affixes or lemmas) and techniques because they performed better than to associate such behaviors with one language or deep learning models trained from scratch on lim- another.1 With the help of this mechanism, our ited data (Yirmibes¸oglu˘ and Eryigit˘ , 2018; Al- models consistently outperform the state of the art Badrashiny and Diab, 2016). Nonetheless, some on LID for Nepali-English (Solorio et al., 2014), researchers have recently shown promising results Spanish-English (Molina et al., 2016), and Hindi- by using pre-trained monolingual embeddings for English (Mave et al., 2018). Moreover, we conduct tasks such as NER (Trivedi et al., 2018; Winata experiments that emphasize the importance of the et al., 2018) and POS tagging (Soto and Hirschberg, position-aware hierarchical attention and the differ- 2018; Ball and Garrette, 2018). Other efforts in- ent effects that it can have based on the similarities clude the use of multilingual sub-word embeddings of the code-switched languages. In the second part, like fastText (Bojanowski et al., 2017) for LID we demonstrate the effectiveness of our CS-ELMo (Mave et al., 2018), and cross-lingual sentence models by further fine-tuning them on tasks such embeddings for text classification like LASER as NER and POS tagging. Specifically, we show (Schwenk, 2018; Schwenk and Li, 2018; Schwenk that the resulting models significantly outperform and Douze, 2017), which is capable of handling multilingual BERT and their homologous ELMo code-switched sentences. These results show the models directly trained for NER and POS tagging. potential of pre-trained knowledge and they moti- Our models establish a new state of the art for vate our efforts to further explore transfer learning Hindi-English POS tagging (Singh et al., 2018) in code-switching settings. and Spanish-English NER (Aguilar et al., 2018). Our contributions can be summarized as follows: Our work is based on ELMo (Peters et al., 2018), 1) we use transfer learning from models trained a large pre-trained language model that has not on a high-resource language (i.e., English) and been applied to CS tasks before. We also use atten- effectively adapt them to the code-switching set- tion (Bahdanau et al., 2015) within ELMo’s con- ting for multiple language pairs on the task of lan- volutions to adapt it to code-switched text. Even guage identification; 2) we show the effectiveness though attention is an effective and successful of transferring a model trained for LID to down- mechanism in other NLP tasks, the code-switching stream code-switching NLP tasks, such as NER literature barely covers such technique (Sitaram and POS tagging, by establishing a new state of the et al., 2019). Wang et al.(2018) use a different at- art; 3) we provide empirical evidence on the im- tention method for NER, which is based on a gated portance of the enhanced character n-gram mech- cell that learns to choose appropriate monolingual anism, which aligns with the intuition of strong embeddings according to the input text. Recently, morphological clues in the core of ELMo (i.e., its Winata et al.(2019) proposed multilingual meta convolutional layers); and 4) our CS-ELMo model embeddings (MME) combined with self-attention is self-contained, which allows us to release it for (Vaswani et al., 2017). Their method establishes a other researchers to explore and replicate this tech- state of the art on Spanish-English NER by heav- nique on other code-switched languages.2 ily relying on monolingual embeddings for every language in the code-switched text. Our model 2 Related Work outperforms theirs by only fine-tuning a generic CS-aware model, without relying on task-specific Transfer learning has become more practical in designs. Another contribution of our work are posi- the last years, making possible to apply very large tion embeddings, which have not been considered neural networks to tasks where annotated data is for code-switching either. These embeddings, com- limited (Howard and Ruder, 2018; Peters et al., bined with CNNs, have proved useful in computer vision (Gehring et al., 2017); they help to local- 1Note that there are more than two labels in the LID tagset, as explained in Section4. ize non-spatial features extracted by convolutional 2http://github.com/RiTUAL-UH/cs_elmo networks within an image. We apply the same prin- 8034 ciple to code-switching: we argue that character to adapt to new morphological patterns from n-grams without position information may not be other languages as the model tends to discard enough for a model to learn the actual morpho- patterns from languages other than English. logical aspects of the languages (e.g., affixes or lemmas). We empirically validate those aspects To address these aspects, we introduce CS-ELMo, and discuss the incidence of such mechanism in an extension of ELMo that incorporates a position- our experiments. aware hierarchical attention mechanism that en- hances ELMo’s character n-gram representations. 3 Methodology This mechanism is composed of three elements: position embeddings, position-aware attention, and ELMo is a character-based language model that hierarchical attention. Figure2A describes the provides deep contextualized word representations overall model architecture, and Figure2B details (Peters et al., 2018). We choose ELMo for this the components of the enhanced character n-gram study for the following reasons: 1) it has been mechanism.