From English to Code-Switching: Transfer Learning with Strong Morphological Clues Gustavo Aguilar and Thamar Solorio Department of Computer Science University of Houston Houston, TX 77204-3010 fgaguilaralas,
[email protected] Abstract Hindi-English Tweet Original: Keep calm and keep kaam se kaam !!!other #office Linguistic Code-switching (CS) is still an un- #tgif #nametag #buddhane #SouvenirFromManali #keepcalm derstudied phenomenon in natural language English: Keep calm and mind your own business !!! processing. The NLP community has mostly focused on monolingual and multi-lingual sce- Nepali-English Tweet narios, but little attention has been given to CS Original: Youtubene ma live re ,other chalcha ki vanni aash in particular. This is partly because of the lack garam !other Optimistic .other of resources and annotated data, despite its in- English: They said Youtube live, let’s hope it works! Optimistic. creasing occurrence in social media platforms. Spanish-English Tweet In this paper, we aim at adapting monolin- Original: @MROlvera06 @T11gRe go too gual models to code-switched text in various other other cavenders y tambien ve a @ElToroBoots tasks. Specifically, we transfer English knowl- ne ne other English: @MROlvera06 @T11gRe go to cavenders and edge from a pre-trained ELMo model to differ- also go to @ElToroBoots ent code-switched language pairs (i.e., Nepali- English, Spanish-English, and Hindi-English) Figure 1: Examples of code-switched tweets and using the task of language identification. Our their translations from the CS LID corpora for Hindi- method, CS-ELMo, is an extension of ELMo English, Nepali-English and Spanish-English. The LID with a simple yet effective position-aware at- labels ne and other in subscripts refer to named enti- tention mechanism inside its character convo- ties and punctuation, emojis or usernames, respectively lutions.