Joint Approach to Deromanization of Code-Mixed Texts
Total Page:16
File Type:pdf, Size:1020Kb
Joint Approach to Deromanization of Code-mixed Texts Rashed Rubby Riyadh and Grzegorz Kondrak Department of Computing Science University of Alberta, Edmonton, Canada friyadh,[email protected] Abstract (a) tomake to decent mone hoyechilo (b) B B E B B The conversion of romanized texts back to the (c) ত োমোকে ত ো decent মকে হকেছিল native scripts is a challenging task because (d) "you" "like" "decent" "in mind" "was" of the inconsistent romanization conventions and non-standard language use. This prob- (e) "you seemed a decent person" lem is compounded by code-mixing, i.e., us- ing words from more than one language within Figure 1: An example Bengali sentence that involves the same discourse. In this paper, we propose both romanization and code-mixing: (a) original mes- a novel approach for handling these two prob- sage; (b) implied language tags; (c) target deroman- lems together in a single system. Our approach ization; (d) word-level translation; (e) sentence-level combines three components: language iden- translation. tification, back-transliteration, and sequence prediction. The results of our experiments on Bengali and Hindi datasets establish the state English words have no transliterated equivalents of the art for the task of deromanization of code-mixed texts. in the native language and script. In this paper, we address the task of deroman- 1 Introduction ization of code-mixed texts. This normalization Ad-hoc romanization is the practice of using the process is necessary in order to take advantage of Roman script to express messages in languages NLP resources and tools that are developed and that have their own native scripts (Figure1). The trained on text corpora written in the standard form phenomenon is observed in informal settings, such of the language, which in turn can facilitate tasks as social media, and is due to either unavailability such as sentiment analysis and opinion mining in of a native-script keyboard, or the writer’s prefer- the social media. In addition, web-search queries ence for using a Roman keyboard. Rather than fol- are often expressed in a romanized form by speak- lowing any predefined inter-script mappings, ro- ers of languages that use non-Latin scripts, such as manized texts typically constitute an idiosyncratic Arabic, Greek, and Hindi (Gupta et al., 2014b). mixture of phonetic spelling, ad-hoc translitera- The task of deromanization of code-mixed texts tions, and abbreviations. A great deal of informa- is related to the study of language variation. Ad- tion is lost in the romanization process due to the hoc romanization represents a language variety, difficulty of representing native phonological dis- which resembles the usage of multiple scripts in tinctions in the Roman script. This makes dero- some languages (e.g., Tajik). Code-mixing can manization of such messages a challenging task also be considered a language variety, which ex- (Irvine et al., 2012). hibits similarities to dialects whose lexicons are Another phenomenon that further complicates strongly influenced by a different language (e.g., the task of deromanization is code-mixing, which Upper Silesian). occurs when words from another language (typi- The individual sub-tasks of deromanization of cally English) are introduced in the messages (e.g., code-mixed texts have been investigated in prior the word decent in Figure1). Code-mixing is par- work, but we are the first to incorporate them in a ticularly common in multi-lingual areas such as single system. Workshops and shared tasks have South Asia (Bali et al., 2014). In many cases, the been devoted to code-mixing, including the prob- 26 Proceedings of VarDial, pages 26–34 Minneapolis, MN, June 7, 2019 c 2019 Association for Computational Linguistics lem of word-level language identification (Chit- 2.1 Language Identification taranjan et al., 2014; Choudhury et al., 2014). While the identification of language of a monolin- Transliteration and back-transliteration is a well- gual document is a well studied problem, the task understood problem, which also has been the of word-level language identification has also gar- topic of several shared tasks (Duan et al., 2016; nered a fair amount of attention recently. A num- Chen et al., 2018). However, unlike romanization, ber of different approaches have been proposed transliteration is focused on names rather than dic- for the task. Among the unsupervised approaches, tionary words, and usually performed without con- dictionary-based and statistical language model- sidering the context of the word in a sentence. Fi- ing approaches are the most common. Conditional nally, a number of papers address the deromaniza- Random Fields (CRF) and Support Vector Ma- tion of social media contents and informal texts, chines (SVMs) are among the most used super- but propose no effective way of handling the code- vised approaches. mixing issue (Irvine et al., 2012; May et al., 2014). We show that this limitation leads to sub-optimal The unsupervised approaches require no word- performance on deromanization. level annotation of mixed-code texts, but gener- ally achieve low accuracy. Dictionary-based ap- In this paper, we propose a novel approach for proaches make use of words and their frequen- tackling the problem of romanization and code- cies in wordlists to determine the origin of a to- mixing together in a single system. Since suf- ken (Barman et al., 2014; Das and Gamback¨ , ficiently large annotated data sets for training an 2014; Verulkar et al., 2015). However, those ap- end-to-end approach are not available, we com- proaches cannot handle spelling variations and bine supervised models for the three main com- non-standard romanizations in code-mixed data. ponents of the complete task: (a) word-level lan- Statistical language modeling approaches em- guage identification, (b) back-transliteration, and ploy n-gram probabilities which are derived from (c) word sequence prediction. These modules in- monolingual corpora. Both word and character n- volve several diverse techniques, including neu- grams have been used in the literature. The ap- ral networks, character-level and word-level lan- proach of Yu et al.(2013), which determines the guage models, discriminative transduction, joint probability of the next word being a code-switched n-grams, and HMMs. We perform experiments word based on the previous n-words, achieves on three datasets that represent two languages, in- only 53% accuracy on the Sinica (Mandarin- cluding a new dataset that we have collected and Taiwanese) corpus. A character n-gram based ap- annotated ourselves. The results show that our proach of Das and Gamback¨ (2014) achieves ap- system is substantially more accurate than Google proximately 70% accuracy when tested on Bengali Translate, which is the only publicly available tool and Hindi. that can be applied to this task. The supervised approaches for language iden- Our main contributions are: (1) a novel ap- tification generally employ hand-crafted features proach to deromanization of code-mixed texts such as capitalization information, character n- through the combination of word-level language gram, and lexicon presence etc. CRFs make use identification, back-transliteration, and sequence of a set of features to determine the most prob- prediction; (2) a system that establishes the state of able language labels for a token sequence (King the art on the task; and (3) an annotated dataset of and Abney, 2013; Chittaranjan et al., 2014; Bar- romanized Bengali messages. We make our code man et al., 2014), and achieve accuracy in the low 1 and data publicly available. 90% on the evaluated languages. SVMs are also commonly employed for language classification 2 Related Work (Barman et al., 2014) and achieve consistent per- formance (low 90%) on Bengali and Hindi. King and Abney(2013) employ Hidden Markov Mod- The tasks of deromanization and word-level lan- els (HMM) trained using Expectation Maximiza- guage identification have been considered sepa- tion (EM) algorithm for the task, which can per- rately in the majority of the previous work. form on par with the CRFs. Finally, supervised approaches that use contextual features generally 1https://github.com/x3r/deromanization outperform approaches that cannot utilize them. 27 2.2 Deromanization ments, and incorporates several weight pushing approaches for fast and memory-efficient decod- Though a number of papers address the deroman- ing. The experiments are conducted on manually ization of social media contents and informal texts, annotated Hindi and Tamil datasets and achieve they propose no effective way of handling the 84% and 78% word-level accuracy, respectively. code-mixing issue. The approach was launched in the Google Gboard Short Message Service (SMS) is a potential keyboard for 22 South Asian languages. source of romanized texts due to the difficulty of typing in the native-script keyboard. A supervised 3 Methods deromanization approach of Irvine et al.(2012) uses an HMM to combine the candidates derived In this section, we present our approach for con- from a character-level transliteration model and verting romanized code-mixed texts to their na- a dictionary derived from automatically aligned tive scripts. It consists of three main compo- words. The approach achieves 51% word-level nents: language identification (Section 3.1), back- accuracy on a self-annotated corpus of informal transliteration (Section 3.2), and sequence predic- Urdu text messages. tion (Section 3.3). Chakma and Das(2014) employ several su- pervised approaches for the automatic transliter- 3.1 Language Identification ation of code-mixed social media texts, which are We approach language identification as a sequence based on joint source channel (JSC) and Interna- labeling task, in which a sequence of word to- tional Phonetic Alphabet (IPA). The experiments kens in a code-mixed text is transformed into a on Bengali-English and Hindi-English social me- sequence of the binary language tags (c.f., Fig- dia datasets show that the IPA-based approach out- ure1b).