A Testbed for Morphological Transformation of Indonesian Colloquial Words
Total Page:16
File Type:pdf, Size:1020Kb
IndoCollex: A Testbed for Morphological Transformation of Indonesian Colloquial Words Haryo Akbarianto Wibowo∗ Made Nindyatama Nityasya∗ Afra Feyza Akyurek¨ ? Suci Fitriany∗ Alham Fikri Aji∗ Radityo Eko Prasojo∗• Derry Tanti Wijaya? ∗ Kata.ai Research Team, Jakarta, Indonesia ? Department of Computer Science, Boston University • Faculty of Computer Science, Universitas Indonesia fharyo,made,suci,aji,[email protected] Abstract more commonly a morphological transformation2 of their standard counterparts.3 Despite these evolv- Indonesian language is heavily riddled with ing lexicons, existing research on Indonesian word colloquialism whether in written or spoken normalization has largely (1) relied on creating forms. In this paper, we identify a class of Indonesian colloquial words that have un- static informal dictionaries (Le et al., 2016), render- dergone morphological transformations from ing normalization of unseen words impossible, and their standard forms, categorize their word for- (2) for the specific task of sentiment analysis (Le mations, and propose a benchmark dataset of et al., 2016) or machine translation (Guntara et al., Indonesian Colloquial Lexicons (IndoCollex) 2020), with no direct implication to word normal- consisting of informal words on Twitter ex- ization in general. Given the obvious utility of cre- pertly annotated with their standard forms and ating NLP systems that can normalize Indonesian their word formation types/tags. We evalu- informal data, we believe that the bottleneck is that ate several models for character-level trans- duction to perform morphological word nor- there is no standard open testbed for researchers malization on this testbed to understand their and developers of such system to test the effective- failure cases and provide baselines for future ness of their models to these colloquial words. work. As IndoCollex catalogues word forma- In this paper, we introduce IndoCollex, a new, re- tion phenomena that are also present in the alistic dataset aimed at testing normalization mod- non-standard text of other languages, it can els to these phenomena. IndoCollex is a profession- also provide an attractive testbed for methods ally annotated dataset, where each informal word tailored for cross-lingual word normalization and non-standard word formation. is paired with its standard form and expertly an- notated with its word formation type. The words 1 Introduction are sampled from Twitter across different regions, therefore contain naturally occurring Indonesian Indonesian language is one of the most widely spo- colloquial words. ken languages in the world with around 200 million We benchmark character-level sequence-to- speakers. Despite its large number of speakers, in sequence transduction with LSTM (Deutsch terms of NLP resources, Indonesian language is et al., 2018; Cotterell et al., 2018) and Trans- not very well represented (Joshi et al., 2020). Most former (Vaswani et al., 2017) architectures, as of its data are in the form of unlabeled web and well as a rule-based approach (Eskander et al., user generated contents in online platforms such 2013; Moeljadi et al., 2019) on our data to under- as social media, which are noisy and riddled with stand their success and failure cases (§7.2, §7.3) colloquialism which poses difficulties for NLP sys- and to provide baselines for future work. We tems (Baldwin et al., 2013a; Eisenstein, 2013a). also test methods for data augmentation in ma- Traditionally, the majority of Indonesian col- chine translation (back-translation), which to the loquial or informal lexicons are borrowed words best of our knowledge has never been applied to from foreign or local dialect words, and sometimes 2 1 We used the term morphological transformations broadly with phonetic and lexical modifications. Increas- here to include word form changes at the respective interfaces ingly however, Indonesian colloquial words are of grammar (phonology, syntax, and semantics), following the definition by Trips(2017). 1For example, gue, a common informal form of aku (‘I’, 3For example, laper, a common informal form of lapar ‘me’), is a word that originates from the Betawi dialect. (‘hungry’), is a phonetic change from its standard form. 3170 Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3170–3183 August 1–6, 2021. ©2021 Association for Computational Linguistics character-level morphological transformation, and logical properties (such as portmanteaus, clipping) observe that adding back-translated data to train that distinguish them from the standard form, justi- transformer improves its performance for normal- fies the need for word-level normalization. izing informal words. We also test models in the Word-level normalization also has its merit be- other direction: generating informal from formal cause due to its much lower hypothesis space, words, which can be useful for generating possible models can be trained using significantly smaller lexical replacements to standard text (Belinkov and amount of data (e.g., compare SIGMORPHON’s Bisk, 2018). 10k examples to WMT’s 106 at high-resource set- ting). Further, from our manual analysis of the 2 Related Work top-10k most frequent Indonesian informal words With the advent of social media and other user we collected from Twitter, we find that around 95% generated contents on the web, non-standard text of these words do not require context to normal- such as informal language, colloquialism and slang ize. Additionally, previous works such as Kulkarni become more prevalent. Concurrently, the rise of and Wang(2018) have suggested that creating com- technologies like unsupervised language model- putational models for this generation of informal ing opened up a new avenue for low-resource lan- words can give us insights into the generative pro- guages which lack annotated data for supervision. cess of word formation in non-standard language. These systems typically only require large amounts This is important because studies into the genera- of unlabeled text to train (Lample and Conneau, tive processes of word formation in non-standard 2019; Brown et al., 2020). However, even when text can deepen our understanding of non-standard NLP systems require only unlabeled data to train, text. Moreover, they are potentially applicable to the varying degrees of formalism between different many languages since word formation patterns are ˇ sources of monolingual data pose domain adaption shared across languages (Stekauer et al., 2012), e.g., challenges to NLP systems which are trained on portmanteaus (such as brexit) have been found not one source (e.g. Wikipedia) to transfer to another only in English but also in many other languages (e.g. social media) (Eisenstein, 2013b; Baldwin such as Indonesian (Dardjowidjojo, 1979), Mod- et al., 2013b; Belinkov and Bisk, 2018; Pei et al., ern Hebrew (Bat-El, 1996), and Spanish (Pineros˜ , 2019). Worse yet, for an overwhelming majority 2004). Finally, the studies may have broader ap- of lower resource languages, unstructured and un- plications including development of rich conversa- labeled text on the Internet is often the sole source tional agents and tools like brand name generators ¨ of data to train NLP systems (Joshi et al., 2020). and headlines (Ozbal and Strapparava, 2012). Therefore, addressing the formalism discrepancy Previous work that qualitatively catalogues or will augment the types of web texts which can be creates computational models for informal word employed in language technologies, especially for formations such as shortening has mostly been in languages such as Indonesian which are subject to English, using LSTMs (Gangal et al., 2017; Kulka- a high degree of informalism as will be discussed. rni and Wang, 2018) or finite state machines (Deri While this motivates research on training sys- and Knight, 2015) to generate informal words given tems that are robust to non-standard data (Michel the standard forms and the type of word formation. and Neubig, 2018; Belinkov and Bisk, 2018; Most of the dataset: formal-informal word pairs la- Tan et al., 2020b,a), one intuitive direction is beled with their word formation used to train these to normalize colloquial language use. Most of models are also in English. Other dictionaries of the work on colloquial language normalization informal English words include SlangNet (Dhu- has been done at the sentence-level: for col- liawala et al., 2016), SlangSD (Wu et al., 2018), loquial English (Han et al., 2013; Lourentzou and SLANGZY (Pei et al., 2019). There is also a et al., 2019), Spanish (Ceron-Guzm´ an´ and Leon-´ dataset that contains pairs of formal-informal In- Guzman´ , 2016), Italian (Weber and Zhekova, donesian words (Salsabila et al., 2018), but they are 2016), Vietnamese (Nguyen et al., 2015), and In- not annotated with word formation mechanisms. donesian (Barik et al., 2019; Wibowo et al., 2020). To the best of our knowledge, ours is the first However, research on the linguistic phenomena of dataset of formal-informal lexicon in a language non-standard text (Mattiello, 2005), which argues other than English that is annotated with their word that slang words exhibit extra-grammatical morpho- formation types. 3171 3 Indonesian Colloquialism mana aku tahu (a jargon for ‘how should I know?’). 3.1 Indonesian Colloquial Words Some of the above transformations are also found Language evolves over time due to the process in the literature of other languages, such as En- of language learning across generations, contact