IndoCollex: A Testbed for Morphological Transformation of Indonesian Colloquial Words Haryo Akbarianto Wibowo∗ Made Nindyatama Nityasya∗ Afra Feyza Akyurek¨ ? Suci Fitriany∗ Alham Fikri Aji∗ Radityo Eko Prasojo∗• Derry Tanti Wijaya? ∗ Kata.ai Research Team, Jakarta, Indonesia ? Department of Computer Science, Boston University • Faculty of Computer Science, Universitas Indonesia fharyo,made,suci,aji,
[email protected] Abstract more commonly a morphological transformation2 of their standard counterparts.3 Despite these evolv- Indonesian language is heavily riddled with ing lexicons, existing research on Indonesian word colloquialism whether in written or spoken normalization has largely (1) relied on creating forms. In this paper, we identify a class of Indonesian colloquial words that have un- static informal dictionaries (Le et al., 2016), render- dergone morphological transformations from ing normalization of unseen words impossible, and their standard forms, categorize their word for- (2) for the specific task of sentiment analysis (Le mations, and propose a benchmark dataset of et al., 2016) or machine translation (Guntara et al., Indonesian Colloquial Lexicons (IndoCollex) 2020), with no direct implication to word normal- consisting of informal words on Twitter ex- ization in general. Given the obvious utility of cre- pertly annotated with their standard forms and ating NLP systems that can normalize Indonesian their word formation types/tags. We evalu- informal data, we believe that the bottleneck is that ate several models for character-level trans- duction to perform morphological word nor- there is no standard open testbed for researchers malization on this testbed to understand their and developers of such system to test the effective- failure cases and provide baselines for future ness of their models to these colloquial words.