Code Mixing: a Challenge for Language Identification in The
Total Page:16
File Type:pdf, Size:1020Kb
Code Mixing: A Challenge for Language Identification in the Language of Social Media Utsab Barman, Amitava Das†, Joachim Wagner and Jennifer Foster CNGL Centre for Global Intelligent Content, National Centre for Language Technology School of Computing, Dublin City University, Dublin, Ireland †Department of Computer Science and Engineering University of North Texas, Denton, Texas, USA ubarman,jwagner,jfoster @computing.dcu.ie { [email protected]} Abstract that only half of the tweets were in English. More- over, mixing multiple languages together (code In social media communication, multilin- mixing) is a popular trend in social media users gual speakers often switch between lan- from language-dense areas (Cardenas-Claros´ and guages, and, in such an environment, au- Isharyanti, 2009; Shafie and Nayan, 2013). In tomatic language identification becomes a scenario where speakers switch between lan- both a necessary and challenging task. guages within a conversation, sentence or even In this paper, we describe our work in word, the task of automatic language identifica- progress on the problem of automatic tion becomes increasingly important to facilitate language identification for the language further processing. of social media. We describe a new Speakers whose first language uses a non- dataset that we are in the process of cre- Roman alphabet write using the Roman alphabet ating, which contains Facebook posts and for convenience (phonetic typing) which increases comments that exhibit code mixing be- the likelihood of code mixing with a Roman- tween Bengali, English and Hindi. We alphabet language. This can be especially ob- also present some preliminary word-level served in South-East Asia and in the Indian sub- language identification experiments using continent. The following is a code mixing com- this dataset. Different techniques are ment taken from a Facebook group of Indian uni- employed, including a simple unsuper- versity students: vised dictionary-based approach, super- vised word-level classification with and Original: Yaar tu to, GOD hain. tui JU without contextual clues, and sequence la- te ki korchis? Hail u man! belling using Conditional Random Fields. Translation: Buddy you are GOD. What We find that the dictionary-based approach are you doing in JU? Hail u man! is surpassed by supervised classification and sequence labelling, and that it is im- This comment is written in three languages: En- portant to take contextual clues into con- glish, Hindi (italics), and Bengali (boldface). For sideration. Bengali and Hindi, phonetic typing has been used. We follow in the footsteps of recent work on 1 Introduction language identification for SMC (Hughes et al., 2006; Baldwin and Lui, 2010; Bergsma et al., Automatic processing and understanding of Social 2012), focusing specifically on the problem of Media Content (SMC) is currently attracting much word-level language identification for code mixing attention from the Natural Language Processing SMC. Our corpus for this task is collected from research community. Although English is still by Facebook and contains instances of Bengali(BN)- far the most popular language in SMC, its domi- English(EN)-Hindi(HI) code mixing. nance is receding. Hong et al. (2011), for exam- The paper is organized as follows: in Section 2, ple, applied an automatic language detection algo- we review related research in the area of code rithm to over 62 million tweets to identify the top mixing and language identification; in Section 3, 10 most popular languages on Twitter. They found we describe our code mixing corpus, the data it- 13 Proceedings of The First Workshop on Computational Approaches to Code Switching, pages 13–23, October 25, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics self and the annotation process; in Section 4, we 2008a; Weiner et al., 2012). Solorio and Liu list the tools and resources which we use in our (2008b) try to predict the points inside a set of spo- language identification experiments, described in ken Spanish-English sentences where the speak- Section 5. Finally, in Section 6, we conclude ers switch between the two languages. Other and provide suggestions for future research on this studies have looked at code mixing in differ- topic. ent types of short texts, such as information re- trieval queries (Gottron and Lipka, 2010) and SMS 2 Background and Related Work messages (Farrugia, 2004; Rosner and Farrugia, 2007). Yamaguchi and Tanaka-Ishii (2012) per- The problem of language identification has been form language identification using artificial mul- investigated for half a century (Gold, 1967) and tilingual data, created by randomly sampling text that of computational analysis of code switching segments from monolingual documents. King for several decades (Joshi, 1982), but there has and Abney (2013) used weakly semi-supervised been less work on automatic language identifi- methods to perform word-level language identifi- cation for multilingual code-mixed texts. Before cation. A dataset of 30 languages has been used turning to that topic, we first briefly survey studies in their work. They explore several language on the general characteristics of code mixing. identification approaches, including a Naive Bayes Code mixing is a normal, natural product of classifier for individual word-level classification bilingual and multilingual language use. Signif- and sequence labelling with Conditional Random icant studies of the phenomenon can be found Fields trained with Generalized Expectation crite- in the linguistics literature (Milroy and Muysken, ria (Mann and McCallum, 2008; Mann and Mc- 1995; Alex, 2008; Auer, 2013). These works Callum, 2010), which achieved the highest scores. mainly discuss the sociological and conversational Another very recent work on this topic is (Nguyen necessities behind code mixing as well as its lin- and Dogru˘ oz,¨ 2013). They report on language guistic nature. Scholars distinguish between inter- identification experiments performed on Turkish sentence, intra-sentence and intra-word code mix- and Dutch forum data. Experiments have been ing. carried out using language models, dictionaries, Several researchers have investigated the rea- logistic regression classification and Conditional sons for and the types of code mixing. Initial stud- Random Fields. They find that language models ies on Chinese-English code mixing in Hong Kong are more robust than dictionaries and that contex- (Li, 2000) and Macao (San, 2009) indicated that tual information is helpful for the task. mainly linguistic motivations were triggering the code mixing in those highly bilingual societies. 3 Corpus Acquisition Hidayat (2012) showed that Facebook users tend Taking into account the claim that code mixing is to mainly use inter-sentential switching over intra- frequent among speakers who are multilingual and sentential, and report that 45% of the switching younger in age (Cardenas-Claros´ and Isharyanti, was instigated by real lexical needs, 40% was used 2009), we choose an Indian student community for talking about a particular topic, and 5% for between the 20-30 year age group as our data content clarification. The predominance of inter- source. India is a country with 30 spoken lan- sentential code mixing in social media text was guages, among which 22 are official. code mix- also noted in the study by San (2009), which com- ing is very frequent in the Indian sub-continent pared the mixing in blog posts to that in the spoken because languages change within very short geo- language in Macao. Dewaele (2010) claims that distances and people generally have a basic knowl- ‘strong emotional arousal’ increases the frequency edge of their neighboring languages. of code mixing. Dey and Fung (2014) present A Facebook group1 and 11 Facebook users a speech corpus of English-Hindi code mixing in (known to the authors) were selected to obtain student interviews and analyse the motivations for publicly available posts and comments. The Face- code mixing and in what grammatical contexts book graph API explorer was used for data collec- code mixing occurs. tion. Since these Facebook users are from West Turning to the work on automatic analysis of Bengal, the most dominant language is Bengali code mixing, there have been some studies on de- tecting code mixing in speech (Solorio and Liu, 1https://www.facebook.com/jumatrimonial 14 (Native Language), followed by English and then 2. Bengali-Sentence: Hindi (National Language of India). The posts [sent-lang=“bn”] shubho nabo borsho.. :) and comments in Bengali and Hindi script were [/sent] discarded during data collection, resulting in 2335 3. Hindi Sentence: posts and 9813 comments. [sent-lang=“hi”] karwa sachh ..... :( [/sent] 4. Mixed-Sentence: 3.1 Annotation [sent-lang=“mixd”][frag-lang=“hi”] oye Four annotators took part in the annotation task. hoye ..... angreji me kahte hai ke [/frag] Three were computer science students and the [frag-lang=“en”] I love u.. !!! [/frag][/sent] other was one of the authors. The annotators are 5. Univ-Sentence: proficient in all three languages of our corpus. A [sent-lang=“univ”] hahahahahahah....!!!!! simple annotation tool was developed which en- [/sent] abled these annotators to identify and distinguish 6. Undef-Sentence: the different languages present in the content by [sent-lang=“undef”] Hablando de una triple tagging them. Annotators were supplied with 4 amenaza. [/sent] basic tags (viz. sentence, fragment, inclusion and wlcm (word-level code mixing)) to annotate differ- Fragment (frag): This