Code Mixing: A Challenge for Language Identification in the Language of Social Media

Utsab Barman, Amitava Das†, Joachim Wagner and Jennifer Foster CNGL Centre for Global Intelligent Content, National Centre for Language Technology School of Computing, Dublin City University, Dublin, Ireland †Department of Computer Science and Engineering University of North Texas, Denton, Texas, USA ubarman,jwagner,jfoster @computing.dcu.ie { [email protected]}

Abstract that only half of the tweets were in English. More- over, mixing multiple languages together (code In social media communication, multilin- mixing) is a popular trend in social media users gual speakers often switch between lan- from language-dense areas (Cardenas-Claros´ and guages, and, in such an environment, au- Isharyanti, 2009; Shafie and Nayan, 2013). In tomatic language identification becomes a scenario where speakers switch between lan- both a necessary and challenging task. guages within a conversation, sentence or even In this paper, we describe our work in , the task of automatic language identifica- progress on the problem of automatic tion becomes increasingly important to facilitate language identification for the language further processing. of social media. We describe a new Speakers whose first language uses a non- dataset that we are in the process of cre- Roman alphabet write using the Roman alphabet ating, which contains Facebook posts and for convenience (phonetic typing) which increases comments that exhibit code mixing be- the likelihood of code mixing with a Roman- tween Bengali, English and Hindi. We alphabet language. This can be especially ob- also present some preliminary word-level served in South-East Asia and in the Indian sub- language identification experiments using continent. The following is a code mixing com- this dataset. Different techniques are ment taken from a Facebook group of Indian uni- employed, including a simple unsuper- versity students: vised dictionary-based approach, super- vised word-level classification with and Original: Yaar tu to, GOD hain. tui JU without contextual clues, and sequence la- te ki korchis? Hail u man! belling using Conditional Random Fields. Translation: Buddy you are GOD. What We find that the dictionary-based approach are you doing in JU? Hail u man! is surpassed by supervised classification and sequence labelling, and that it is im- This comment is written in three languages: En- portant to take contextual clues into con- glish, Hindi (italics), and Bengali (boldface). For sideration. Bengali and Hindi, phonetic typing has been used. We follow in the footsteps of recent work on 1 Introduction language identification for SMC (Hughes et al., 2006; Baldwin and Lui, 2010; Bergsma et al., Automatic processing and understanding of Social 2012), focusing specifically on the problem of Media Content (SMC) is currently attracting much word-level language identification for code mixing attention from the Natural Language Processing SMC. Our corpus for this task is collected from research community. Although English is still by Facebook and contains instances of Bengali(BN)- far the most popular language in SMC, its domi- English(EN)-Hindi(HI) code mixing. nance is receding. Hong et al. (2011), for exam- The paper is organized as follows: in Section 2, ple, applied an automatic language detection algo- we review related research in the area of code rithm to over 62 million tweets to identify the top mixing and language identification; in Section 3, 10 most popular languages on Twitter. They found we describe our code mixing corpus, the data it-

13 Proceedings of The First Workshop on Computational Approaches to Code Switching, pages 13–23, October 25, 2014, Doha, Qatar. c 2014 Association for Computational Linguistics self and the annotation process; in Section 4, we 2008a; Weiner et al., 2012). Solorio and Liu list the tools and resources which we use in our (2008b) try to predict the points inside a set of spo- language identification experiments, described in ken Spanish-English sentences where the speak- Section 5. Finally, in Section 6, we conclude ers switch between the two languages. Other and provide suggestions for future research on this studies have looked at code mixing in differ- topic. ent types of short texts, such as information re- trieval queries (Gottron and Lipka, 2010) and SMS 2 Background and Related Work messages (Farrugia, 2004; Rosner and Farrugia, 2007). Yamaguchi and Tanaka-Ishii (2012) per- The problem of language identification has been form language identification using artificial mul- investigated for half a century (Gold, 1967) and tilingual data, created by randomly sampling text that of computational analysis of code switching segments from monolingual documents. King for several decades (Joshi, 1982), but there has and Abney (2013) used weakly semi-supervised been less work on automatic language identifi- methods to perform word-level language identifi- cation for multilingual code-mixed texts. Before cation. A dataset of 30 languages has been used turning to that topic, we first briefly survey studies in their work. They explore several language on the general characteristics of code mixing. identification approaches, including a Naive Bayes Code mixing is a normal, natural product of classifier for individual word-level classification bilingual and multilingual language use. Signif- and sequence labelling with Conditional Random icant studies of the phenomenon can be found Fields trained with Generalized Expectation crite- in the linguistics literature (Milroy and Muysken, ria (Mann and McCallum, 2008; Mann and Mc- 1995; Alex, 2008; Auer, 2013). These works Callum, 2010), which achieved the highest scores. mainly discuss the sociological and conversational Another very recent work on this topic is (Nguyen necessities behind code mixing as well as its lin- and Dogru˘ oz,¨ 2013). They report on language guistic nature. Scholars distinguish between inter- identification experiments performed on Turkish sentence, intra-sentence and intra-word code mix- and Dutch forum data. Experiments have been ing. carried out using language models, dictionaries, Several researchers have investigated the rea- logistic regression classification and Conditional sons for and the types of code mixing. Initial stud- Random Fields. They find that language models ies on Chinese-English code mixing in Hong Kong are more robust than dictionaries and that contex- (Li, 2000) and Macao (San, 2009) indicated that tual information is helpful for the task. mainly linguistic motivations were triggering the code mixing in those highly bilingual societies. 3 Corpus Acquisition Hidayat (2012) showed that Facebook users tend Taking into account the claim that code mixing is to mainly use inter-sentential switching over intra- frequent among speakers who are multilingual and sentential, and report that 45% of the switching younger in age (Cardenas-Claros´ and Isharyanti, was instigated by real lexical needs, 40% was used 2009), we choose an Indian student community for talking about a particular topic, and 5% for between the 20-30 year age group as our data content clarification. The predominance of inter- source. India is a country with 30 spoken lan- sentential code mixing in social media text was guages, among which 22 are official. code mix- also noted in the study by San (2009), which com- ing is very frequent in the Indian sub-continent pared the mixing in blog posts to that in the spoken because languages change within very short geo- language in Macao. Dewaele (2010) claims that distances and people generally have a basic knowl- ‘strong emotional arousal’ increases the frequency edge of their neighboring languages. of code mixing. Dey and Fung (2014) present A Facebook group1 and 11 Facebook users a speech corpus of English-Hindi code mixing in (known to the authors) were selected to obtain student interviews and analyse the motivations for publicly available posts and comments. The Face- code mixing and in what grammatical contexts book graph API explorer was used for data collec- code mixing occurs. tion. Since these Facebook users are from West Turning to the work on automatic analysis of Bengal, the most dominant language is Bengali code mixing, there have been some studies on de- tecting code mixing in speech (Solorio and Liu, 1https://www.facebook.com/jumatrimonial

14 (Native Language), followed by English and then 2. Bengali-Sentence: Hindi (National Language of India). The posts [sent-lang=“bn”] shubho nabo borsho.. :) and comments in Bengali and Hindi script were [/sent] discarded during data collection, resulting in 2335 3. Hindi Sentence: posts and 9813 comments. [sent-lang=“hi”] karwa sachh ..... :( [/sent] 4. Mixed-Sentence: 3.1 Annotation [sent-lang=“mixd”][frag-lang=“hi”] oye Four annotators took part in the annotation task. hoye ..... angreji me kahte hai ke [/frag] Three were computer science students and the [frag-lang=“en”] I love u.. !!! [/frag][/sent] other was one of the authors. The annotators are 5. Univ-Sentence: proficient in all three languages of our corpus. A [sent-lang=“univ”] hahahahahahah....!!!!! simple annotation tool was developed which en- [/sent] abled these annotators to identify and distinguish 6. Undef-Sentence: the different languages present in the content by [sent-lang=“undef”] Hablando de una triple tagging them. Annotators were supplied with 4 amenaza. [/sent] basic tags (viz. sentence, fragment, inclusion and wlcm (word-level code mixing)) to annotate differ- Fragment (frag): This refers to a group of for- ent levels of code mixing. Under each tag, six at- eign , grammatically related, in a sentence. tributes were provided, viz. English (en), Bengali The presence of this tag in a sentence conveys that (bn), Hindi (hi), Mixed (mixd), Universal (univ) intra-sentential code mixing has occurred within and Undefined (undef). The attribute univ is as- the sentence boundary. Identification of fragments sociated with symbols, numbers, emoticons and (if present) in a sentence was the second task of universal expressions (e.g. hahaha, lol). The at- annotation. A sentence (sent) with attribute mixd tribute undef is specified for a sentence or a word must contain multiple fragments (frag) with a spe- for which no language tags can be attributed or cific language attribute. In the fourth example cannot be categorized as univ. In addition, anno- above, the sentence contains a Hindi fragment oye tators were instructed to annotate named entities hoye ..... angreji me kahte hai ke and an English separately. What follows are descriptions of each fragment I love u.. !!!, hence it is considered as a of the annotation tags. mixd sentence. A fragment can have any number of inclusions and word-level code mixing. In the Sentence (sent): This tag refers to a sentence first example below, Jio is a popular Bengali word and can be used to mark inter-sentential code mix- appearing in the English fragment Jio.. good joke, ing. Annotators were instructed to identify a sen- hence tagged as a Bengali inclusion. One can ar- tence with its base language (e.g. en, bn, hi and gue that the word Jio could be a separate Bengali mixd) or with other types (e.g. univ, undef ) as the inclusion (i.e. can be tagged as a Bengali inclu- first task of annotation. Only the attribute mixd is sion outside the English fragment). But looking used to refer to a sentence which contains multi- at the syntactic pattern and the sense expressed by ple languages in the same proportion. A sentence the comment, the annotator kept it as a single unit. may contain any number of inclusions, fragments In the second example below, an instance of word- and word-level code mixing. A sentence can be at- level code mixing, typer, has been found in an En- tributed as univ if and only if it contains symbols, glish fragment (where the root English word type numbers, emoticons, chat acronyms and no other has the Bengali suffix r). words (Hindi, English or Bengali). A sentence can be attributed as undef if it is not a sentence marked 1. Fragment with Inclusion: as univ and has words/tokens that can not be cate- [sent-lang=“mixd”][frag-lang=“en”][incl- gorized as Hindi, English or Bengali. Some exam- lang=“bn”] Jio.. [/incl] good joke [/frag][frag ples of sentence-level annotations are the follow- lang=“bn”] ”amar Babin” [/frag][/sent] ing: 2. Fragment with Word-Level code mixing: [sent-lang=“mixd”][frag-lang=“en”] ” I will 1. English-Sentence: find u and marry you ” [/frag][frag- [sent-lang=“en”] what a.....6 hrs long...but re- lang=“bn”][wlcm-type=“en-and-bn-suffix”] ally nice tennis.... [/sent] typer [/wlcm] hoe glo to! :D [/frag][/sent]

15 Inclusion (incl): An inclusion is a foreign word that annotation instruction is also straightforward or phrase in a sentence or in a fragment which are confirmed in a high inter-annotator agreement is assimilated or used very frequently in native (IAA) with a Kappa value of 0.884. language. Identification of inclusions can be per- formed after annotating a sentence and fragment 3.2 Data Characteristics (if present in that sentence). An inclusion within a Tag-level and word-level statistics of annotated sentence or fragment also denotes intra-sentential data that reveal the characteristics of our data set code mixing. In the example below, seriously is an are described in Table 1 and in Table 2 respec- English inclusion which is assimilated in today’s tively. More than 56% of total sentences and al- colloquial Bengali and Hindi. The only tag that an most 40% of total tokens are in Bengali, which is inclusion may contain is word-level code mixing. the dominant language of this corpus. English is the second most dominant language covering al- 1. Sentence with Inclusion: most 33% of total tokens and 35% of total sen- [sent-lang=“bn”] Na re [incl-lang=“en”] seri- tences. The amount of Hindi data is substantially ously [/incl] ami khub kharap achi. [/sent] lower – nearly 1.75% of total tokens and 2% of to- tal sentences. However, English inclusions (84% Word-Level code mixing (wlcm): This is the of total inclusions) are more prominent than Hindi smallest unit of code mixing. This tag was in- or Bengali inclusions and there are a substantial troduced to capture intra-word code mixing and number of English fragments (almost 52% of total denotes cases where code mixing has occurred fragments) present in our corpus. This means that within a single word. Identifying word-level code English is the main language involved in the code mixing is the last task of annotation. Annotators mixing. were told to mention the type of word-level code mixing in the form of an attribute (Base Language Statistics of Different Tags + Second Language) format. Some examples are Tags En Bn Hi Mixd Univ Undef sent 5,370 8,523 354 204 746 15 provided below. In the first example below, the frag 288 213 40 0 6 0 root word class is English and e is an Bengali suf- incl 7,377 262 94 0 1,032 1 fix that has been added. In the third example be- wlcm 477 low, the opposite can be observed – the root word Name Entity 3,602 Kando is Bengali, and an English suffix z has been Acronym 691 added. In the second example below, a named en- Table 1: Tag-level statistics tity suman is present with a Bengali suffix er.

1. Word-Level code mixing (EN-BN): Word-Level Tag Count [ ] EN 66,298 wlcm-type=“en-and-bn-suffix” classe BN 79,899 [/wlcm] HI 3,440 2. Word-Level code mixing (NE-BN): WLCM 633 [wlcm-type=“NE-and-bn-suffix”] sumaner NE 5,233 ACRO 715 [/wlcm] UNIV 39,291 3. Word-Level code mixing (BN-EN): UNDEF 61 [wlcm-type=“bn-and-en-suffix”] kandoz Table 2: Word-level statistics [/wlcm]

3.1.1 Inter Annotator Agreement 3.2.1 Code Mixing Types We calculate word-level inter annotator agreement In our corpus, inter- and intra-sentential code mix- (Cohen’s Kappa) on a subset of 100 comments ing are more prominent than word-level code mix- (randomly selected) between two annotators. Two ing, which is similar to the findings of (Hidayat, annotators are in agreement about a word if they 2012) . Our corpus contains every type of code both annotate the word with the same attribute mixing in English, Hindi and Bengali viz. in- (en, bn, hi, univ, undef ), regardless of whether ter/intra sentential and word-level as described in the word is inside an inclusion, fragment or sen- the previous section. Some examples of different tence. Our observations that the word-level anno- types of code mixing in our corpus are presented tation process is not a very ambiguous task and below.

16 1. Inter-Sentential: tators were careful enough to tag such words as [sent-lang=“hi”] Itna izzat diye aapne mujhe English and sometimes these words were tagged !!! [/sent] in the annotators’ native languages. During cross [sent-lang=“en”] Tears of joy. :’( :’( [/sent] checking of the annotated data the same error pat- 2. Intra-Sentential: terns were observed for multiple annotators, i.e. [sent-lang=“bn”][incl-lang=“en”] by d way tagging commonly used foreign words into native [/incl] ei [frag-lang=“en”] my craving arms language. It only demonstrates that these English shall forever remain empty .. never hold u words are highly assimilated in the conversational close .. [/frag] line ta baddo [incl-lang=“en”] vocabulary of Bengali and Hindi. cheezy [/incl] :P ;) [/sent] Due to pho- 3. Word-Level: Phonetic Similarity of Spellings [sent-lang=“bn”][incl-lang=“en”] 1st yr netic typing some words share the same surface [/incl] eo to ei [wlcm-type=“en+bnSuffix”] form across two and sometimes across three lan- tymer [/wlcm] modhye sobar jute jay .. guages. As an example, to is a word in the three [/sent] languages: it has occurred 1209 times as English, 715 times as Bengali and 55 times as Hindi in our 3.2.2 Ambiguous Words data. The meaning of these words (e.g. to, bolo, Annotators were instructed to tag an English word die) are different in different languages. This phe- as English irrespective of any influence of word nomenon is perhaps exacerbated by the trend to- borrowing or foreign inclusion but an inspection of wards short and noisy spelling in SMC. the annotations revealed that English words were sometimes annotated as Bengali or Hindi. To un- 4 Tools and Resources derstand this phenomenon we processed the list We have used the following resources and tools in of language (EN,BN and HI) word types (total our experiment. 26,475) and observed the percentage of types that were not always annotated with the one language Dictionaries throughout the corpus. The results are presented in Table 3. Almost 7% of total types are ambiguous 1. British National Corpus (BNC): We com- (i.e. tagged in different languages during annota- pile a word frequency list from the BNC (As- tion). Among them, a substantial amount (5.58%) ton and Burnard, 1998). are English/Bengali. 2. SEMEVAL 2013 Twitter Corpus (Se- mevalTwitter): To cope with the language Label(s) Count Percentage of social media we use the SEMEVAL 2013 EN 9,109 34.40 BN 14,345 54.18 (Nakov et al., 2013) training data for the HI 1,039 3.92 Twitter sentiment analysis task. This data EN or BN 1,479 5.58 comes from a popular social media site and EN or HI 61 0.23 hence is likely to reflect the linguistic proper- BN or HI 277 1.04 EN or BN or HI 165 0.62 ties of SMC. 3. Lexical Normalization List (LexNorm- Table 3: Statistics of ambiguous and monolingual List): Spelling variation is a well-known word types phenomenon in SMC. We use a lexical nor- malization dictionary created by Han et al. There are two reasons why this is happening: (2012) to handle the different spelling vari- Same Words Across Languages Some words ations in our data. are the same (e.g. baba, maa, na, khali) in Hindi Machine Learning Toolkits and Bengali because both of the languages orig- inated from a single language Sanskrit and share 1. WEKA: We use the Weka toolkit (Hall et a good amount of common vocabulary. It also al., 2009) for our experiments in decision tree occurred in English-Hindi and English-Bengali as training. a result of word borrowing. Most of these are commonly used inclusions like clg, dept, ques- 2. MALLET: CRF learning is applied using the tion, cigarette, and topic. Sometimes the anno- MALLET toolkit (McCallum, 2002).

17 3. Liblinear: We apply Support Vector Ma- can classify a word to be a named entity or an in- chine (SVM) learning with a linear kernel us- stance of word-level code mixing. To avoid this, ing the Liblinear package (Fan et al., 2008). we implement a post-processor which backs off in these cases to a system which hasn’t seen named NLP Tools For data tokenization we used the entities or word-level code mixing in training (see CMU Tweet-Tokenizer (Owoputi et al., 2013). Section 5.3). 5 Experiments 5.1 Dictionary-Based Detection Since our training data is entirely labelled at the We start with dictionary-based language detec- word-level by human annotators, we address the tion. Generally a dictionary-based language de- word-level language identification task in a fully tector predicts the language of a word based on supervised way. its frequency in multiple language dictionaries. In Out of the total data, 15% is set aside as a our data the Bengali and Hindi tokens are phoneti- blind test set, while the rest is employed in our ex- cally typed. As no such transliterated dictionary is, periments through a 5-fold cross-validation setup. to our knowledge, available for Bengali and Hindi, There is a substantial amount of token overlap be- we use the training set words as dictionaries. For tween the cross-validation data and the test set – words that have multiple annotations in training 88% of total EN tokens, 86% of total Bengali to- data (ambiguous words), we select the majority kens and 57% of total Hindi tokens of the test set tag based on frequency, e.g. the word to will al- are present in the cross-validation data.2 ways be tagged as English. We address the problem of word-level in three Our English dictionaries are those described different ways: in Section 4 (BNC, LexNormList, SemEvalTwit- 1. A simple heuristic-based approach which ter) and the training set words. For LexNorm- uses a combination of our dictionaries to clas- List, we have no frequency information, and so sify the language of a word we consider it as a simple . To pre- 2. Word-level classification using supervised dict the language of a word, dictionaries with nor- machine learning with SVMs but no contex- malized frequency were considered first (BNC, tual information SemEvalTwitter, Training Data), if not found, 3. Word-level classification using supervised word list look-up was performed. The predicted machine learning with SVMs and sequence language is chosen based on the dominant lan- labelling using CRFs, both employing con- guage(s) of the corpus if the word appears in mul- textual information tiple dictionaries with same frequency or if the Named entities and instances of word-level word does not appear in any dictionary or list. code mixing are excluded from evaluation. For A simple rule-based method is applied to pre- systems which do not take the context of a word dict universal expressions. A token is considered into account, i.e. the dictionary-based approach as univ if any of the following conditions satisfies: (Section 5.1) and the SVM approach without con- All characters of the token are symbols or textual clues (Section 5.2), named entities and in- • numbers. stances of word-level code mixing can be safely The token contains certain repetitions identi- excluded from training. For systems which do • fied by regular expressions.(e.g. hahaha). take context into account, the CRF system (Sec- The token is a hash-tag or an URL or tion 5.3.1) and the SVM system with contextual • clues (Section 5.3.2), these are included in train- mention-tags (e.g. @Sumit). Tokens (e.g. lol) identified by a word list ing, because to exclude them would result in un- • realistic contexts. This means that these systems compiled from the relevant 4/5th of the train- ing data. 2We found 25 comments and 17 posts common between the cross-validation data and the test set. The reason for this is that users of social media often express themselves in a Table 4 shows the results of dictionary-based concise way. Almost all of these common data consisted of 1 detection obtained from 5-fold cross-validation to 3 token(s). In most of the cases these tokens were emoti- averaging. We try different combinations and fre- cons, symbols or universal expressions such as wow and lol. As the percentage of these comments is low, we keep these quency thresholds of the above dictionaries. We comments as they are. find that using a normalized frequency is helpful

18 and that a combination of LexNormList and Train- Features Accuracy Features Accuracy G 94.62 GD 94.67 ing Data dictionaries is suited best for our data. GL 94.62 GDL 94.73 Hence, we consider this as our baseline language GC 94.64 GDC 94.72 identification system. GLC 94.64 GDLC 94.75

Dictionary Accuracy(%) Table 5: Average cross-validation accuracy for BNC 80.09 SVM word-level classification (without context), SemevalTwitter 77.61 LexNormList 79.86 G = char-n-gram, L = binary length features, D Training Data 90.21 = presence in dictionaries and C = capitalization LexNormList+TrainingData (Baseline) 93.12 features

Table 4: Average cross-validation accuracy of GDLC: 94.75% dictionary-based detection

GLC: 94.64% GDL: 94.73% GDC: 94.72% 5.2 Word-Level Classification without Contextual Clues The following feature types are employed: GL: 94.62% GC: 94.64% GD: 94.67%

1. Char-n-grams (G): We start with a character n-gram-based approach (Cavnar and Tren- G: 94.62% kle, 1994), which is most common and fol- Figure 1: Average cross-validation accuracy for lowed by many language identification re- SVM word-level classification (without context), searchers. Following the work of King and G = char-n-gram, L = binary length features, D Abney (2013), we select character n-grams = presence in dictionaries and C = capitalization (n=1 to 5) and the word as the features in our features: cube visualization experiments. 2. Presence in Dictionaries (D): We use pres- ence in a dictionary as a features for all avail- set and best cross-validation accuracy is found for able dictionaries in previous experiments. the GDLC-based run (94.75%) at C=1 (see Table 5 3. Length of words (L): Instead of using the and Fig. 1). raw length value as a feature, we follow our We also investigate the use of a dictionary-to- previous work (Rubino et al., 2013; Wagner char-n-gram back-off model – the idea is to ap- et al., 2014) and create multiple features for ply the char-n-gram model SVM-GDLC for those length using a decision tree (J48). We use words for which a majority-based decision is taken length as the only feature to train a decision during dictionary-based detection. However, it tree for each fold and use the nodes obtained does not outperform the SVM. Hence, we select from the tree to create boolean features. SVM-GDLC for the next steps of our experiments 4. Capitalization (C): We use 3 boolean fea- as the best exemplar of our individual word-level tures to encode capitalization information: classifier (without contextual clues). whether any letter in the word is capitalized, whether all letters in the word are capitalized 5.3 Language Identification with Contextual and whether the first letter is capitalized. Clues Contextual clues can play a very important role in We perform experiments with an SVM classifier word-level language identification. As an exam- (linear kernel) for different combination of these ple, a part of a comment is presented from cross- 3 -15 features. Parameter optimizations (C range 2 validation fold 1 that contains the word die which 10 to 2 ) for SVM are performed for each feature is wrongly classified by the SVM classifier. The 3According to (Hsu et al., 2010) the SVM linear kernel frequency of die in the training set of fold 1 is 6 with parameter C optimization is good enough when dealing for English, 31 for Bengali and 0 for Hindi. with a large number of features. Though an RBF kernel can be more effective than a linear one, it is possible only after proper optimization of C and γ parameters, which is compu- Gold Data: ..../univ the/en movie/en tational expensive for such a large feature set. for/en which/en i/en can/en die/en for/en

19 Features Order-0 Order-1 Order-2 GDLC: 95.98% G 92.80 95.16 95.36 GD 93.42 95.59 95.98 GL 92.82 95.14 95.41 GDL 93.47 95.60 95.94 GLC: 95.02% GDL: 95.94% GDC: 95.98% GC 92.07 94.60 95.05 GDC 93.47 95.62 95.98 GLC 92.36 94.53 95.02 GL: 95.41% GC: 95.05% GD: 95.98% GDLC 93.47 95.58 95.98

Table 6: Average cross-validation accuracy of G: 95.36% CRF-based predictions where G = char-n-gram, L = length feature, D = single dictionary-based la- Figure 2: CRF Order-2 results: cube visualisation bels (baseline system) and C = capitalization fea- G = char-n-gram, L = binary length features, D tures = presence in dictionaries and C = capitalization features ...../univ Context Accuracy (%) SVM Output: ..../univ the/en GDLC + P1 94.66 GDLC + P2 94.55 movie/en for/en which/en i/en can/en GDLC + N1 94.53 die/bn for/en ...../univ GDLC + N2 94.37 GDLC + P1N1 95.14 We now investigate whether contextual informa- GDLC + P2N2 94.55 tion can correct the mis-classified tags. Although named entities and word-level code Table 7: Average cross-validation accuracy of mixing are excluded from evaluation, when deal- SVM (GDLC) context-based runs, where P-i = ing with context it is important to consider named previous i word(s) , N-i = next i word(s) entity and word-level code mixing during training because these may contain some important infor- in SVM-based runs). However, we observe that mation. We include these tokens in the training accuracy of CRF based runs decreases when bi- data for our context-based experiments, labelling narized length features (see Section 5.2 and dic- them as other. The presence of this new label may tionary features (a feature for each dictionary) are affect the prediction for a language token during involved. Hence, we use the dictionary-based pre- classification and sequence labelling. To avoid this dictions of the baseline system to generate a single situation, a 4-way (bn, hi, en, univ) backoff classi- dictionary feature for each token and only the raw fier is trained separately on English, Hindi, Ben- length value of a token instead of binarized length gali and universal tokens. During evaluation of features. The results are presented in Table 6 and any context-based system we discard named en- the second order results are visualized in Fig. 2. tity and word-level code mixing from the predic- As expected, the performance increases as the tion of that system. If any of the remaining tokens order increases from zero to one and two. The use is predicted as other we back off to the decision of a single dictionary feature is also helpful. The of the 4-way classifier for that token. For the CRF results for GDC, GDLC, and GD based runs are experiments (Section 5.3.1), the backoff classifier almost similar (95.98%). However, we choose the is a CRF system, and, for the SVM experiments GDC system because it performed slightly better (Section 5.3.2), the backoff classifier is an SVM (95.989%) than the GDLC (95.983%) and the GD system. (95.983%) systems. 5.3.1 Conditional Random Fields (CRF) 5.3.2 SVM with Context As our goal is to apply contextual clues, we first employ Conditional Random Fields (CRF), an ap- We also add contextual clues to our SVM classi- proach which takes history into account in pre- fier. To obtain contextual information we include dicting the optimal sequence of labels. We em- the previous and next two words as features in 4 ploy a linear chain CRF with an increasing or- the SVM-GDLC-based run. All possible com- der (Order-0, Order-1 and Order-2) with 200 it- 4We also experimented with extracting all GDLC features erations for different feature combinations (used for the context words but this did not help.

20 binations are considered during experiments (Ta- et al., 2014). ble 7). After C parameter optimization, the best We did not include word-level code mixing in cross-validation accuracy is found for the P1N1 our experiments – in our future experiments we (one word previous and one word next) run with will explore ways to identify and segment this type C=0.125 (95.14%). of code mixing. It will be also important to find the best way to handle inclusions since there is a fine 5.4 Test Set Results line between word borrowing and code mixing. We apply our best dictionary-based system, our best SVM system (with and without context) and Acknowledgements our best CRF system to the held-out test set. The This research is supported by the Science Founda- results are shown in Table 8. Our best result is tion Ireland (Grant 12/CE/I2267) as part of CNGL achieved using the CRF model (95.76%). (www.cngl.ie) at Dublin City University. The authors wish to acknowledge the DJEI/DES/SFI/ 5.5 Error Analysis HEA for the provision of computational facili- Manual error analysis shows the limitations of ties and support. Our special thanks to Soumik these systems. The word-level classifier without Mandal from Jadavpur University, India for co- contextual clues does not perform well with Hindi ordinating the annotation task. We also thank the data. The number of Hindi tokens is quite low. administrator of JUMatrimonial and the 11 Face- Only 2.4% (4,658) of total tokens of the training book users who agreed that we can use their posts data are Hindi, out of which 55.36% are bilin- for their support and permission. gually ambiguous and 29.51% are tri-lingually ambiguous tokens. Individual word-level systems often fail to assign proper labels to ambiguous References words, but adding context information helps to Beatrice Alex. 2008. Automatic detection of English overcome this problem. Considering the previ- inclusions in mixed-lingual data with an application ous example of die, both context-based SVM and to parsing. Ph.D. thesis, School of Informatics, The CRF systems classify it properly. Though the final University of Edinburgh, Edinburgh, UK. system CRF-GDC performs well, it also has some Guy Aston and Lou Burnard. 1998. The BNC hand- limitations, failing to identify the language for the book: exploring the British National Corpus with tokens which appear very frequently in three lan- SARA. Capstone. guages (e.g. are, na, pic). Peter Auer. 2013. Code-Switching in Conversation: Language, Interaction and Identity. Routledge. 6 Conclusion Timothy Baldwin and Marco Lui. 2010. Language We have presented an initial study on automatic identification: The long and the short of the mat- language identification with Indian language code ter. In Human Language Technologies: The 2010 mixing from social media communication. We Annual Conference of the North American Chap- ter of the Association for Computational Linguistics, described our dataset of Bengali-Hindi-English pages 229–237. Association for Computational Lin- Facebook comments and we presented the results guistics. of our word-level classification experiments on this dataset. Our experimental results lead us to Utsab Barman, Joachim Wagner, Grzegorz Chrupała, n and Jennifer Foster. 2014. DCU-UVT: Word- conclude that character -gram features are useful level language classification with code-mixed data. for this task, contextual information is also impor- In Proceedings of the First Workshop on Com- tant and that information from dictionaries can be putational Approaches to Code-Switching. EMNLP effectively incorporated as features. 2014, Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. Association for In the future we plan to apply the techniques Computational Linguistics. and feature sets that we used in these experiments to other datasets. We have already started this by Shane Bergsma, Paul McNamee, Mossaab Bagdouri, applying variants of the systems presented here to Clayton Fink, and Theresa Wilson. 2012. Language identification for creating language-specific Twitter the Nepali-English and Spanish-English datasets collections. In Proceedings of the Second Workshop which were introduced as part of the 2014 code on Language in Social Media, pages 65–74. Associ- mixing shared task (Solorio et al., 2014; Barman ation for Computational Linguistics.

21 Precision (%) Recall (%) Accuracy System EN BN HI UNIV EN BN HI UNIV (%) Baseline (Dictionary) 92.67 90.73 80.64 99.67 92.28 94.63 43.47 94.99 93.64 SVM-GDLC 92.49 94.89 80.31 99.34 96.23 94.28 44.92 97.07 95.21 SVM-P1N1 93.51 95.56 83.18 99.42 96.63 95.23 55.94 96.95 95.52 CRF-GDC 94.77 94.88 91.86 99.34 95.65 96.22 55.65 97.73 95.76

Table 8: Test set results for Baseline (Dictionary), SVM-GDLC, SVM-P1N1 and CRF-GDC

MS Cardenas-Claros´ and N Isharyanti. 2009. Code- study programme “Pendidikan Bahasa Ing- switching and code-mixing in internet chatting: gris” (English Education) at STKIP Siliwangi Between’yes,”ya,’and’si’-a case study. The Jalt Call Bandung, Indonesia, http://publikasi. Journal, 5(3):67–78. stkipsiliwangi.ac.id/files/2012/ 10/08220227-taofik-hidayat.pdf. William B. Cavnar and John M. Trenkle. 1994. N- gram-based text categorization. In Theo Pavlidis, Lichan Hong, Gregorio Convertino, and Ed H. Chi. editor, Proceedings of SDAIR-94, Third Annual 2011. Language matters in twitter: A large scale Symposium on Document Analysis and Information study. In Proceedings of the Fifth International Retrieval, pages 161–175. AAAI Conference on Weblogs and Social Media (ICWSM-11), pages 518–521, Barcelona, Spain. As- Jean-Marc Dewaele. 2010. Emotions in Multiple Lan- sociation for the Advancement of Artificial Intelli- guages. Palgrave Macmillan. gence. Anik Dey and Pascale Fung. 2014. A Hindi- English code-switching corpus. In Proceedings of Chih-Wei Hsu, Chih-Chung Chang, and Chih- the Ninth International Conference on Language Re- Jen Lin. 2010. A practical guide to sup- sources and Evaluation (LREC’14), pages 2410– port vector classification. Technical re- 2413, Reykjavik, Iceland. European Language Re- port. Department of Computer Science, Na- sources Association (ELRA). tional University, Taiwan, https: //www.cs.sfu.ca/people/Faculty/ Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- teaching/726/spring11/svmguide.pdf. Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Machine Learning Research, 9:1871–1874. Nicholson, and Andrew MacKinlay. 2006. Recon- sidering language identification for written language Paulseph-John Farrugia. 2004. TTS pre-processing is- resources. In Proc. of the 5th edition of the Interna- sues for mixed language support. In Proceedings tional Conference on Language Resources and Eval- of CSAW’04, the second Computer Science Annual uation (LREC 2006), pages 485–488, Genoa, Italy. Workshop, pages 36–41. Department of Computer Science & A.I., University of Malta. Aravind K. Joshi. 1982. Processing of sentences with intra-sentential code-switching. In J. Horecky,´ ed- E Mark Gold. 1967. Language identification in the itor, Proceedings of the 9th conference on Compu- limit. Information and control, 10(5):447–474. tational linguistics - Volume 1 (COLING’82), pages 145–150. Academia Praha, North-Holland Publish- Thomas Gottron and Nedim Lipka. 2010. A compar- ing Company. ison of language identification approaches on short, query-style texts. In Advances in Information Re- Ben King and Steven Abney. 2013. Labeling the lan- trieval, pages 611–614. Springer. guages of words in mixed-language documents us- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard ing weakly supervised methods. In Proceedings of Pfahringer, Peter Reutemann, and Ian H. Witten. the 2013 Conference of the North American Chap- 2009. The WEKA data mining software: An up- ter of the Association for Computational Linguis- date. SIGKDD Explor. Newsl., 11(1):10–18. tics: Human Language Technologies, pages 1110– 1119, Atlanta, Georgia. Association for Computa- Bo Han, Paul Cook, and Timothy Baldwin. 2012. tional Linguistics. Automatically constructing a normalisation dictio- nary for microblogs. In Proceedings of the 2012 David C. S. Li. 2000. Cantonese-English code- Joint Conference on Empirical Methods in Natu- switching research in Hong Kong: a Y2K review. ral Language Processing and Computational Natu- World Englishes, 19(3):305–322. ral Language Learning, pages 421–432. Association for Computational Linguistics. Gideon S. Mann and Andrew McCallum. 2008. Generalized expectation criteria for semi-supervised Taofik Hidayat. 2012. An analysis of code switch- learning of conditional random fields. In Proceed- ing used by facebookers: a case study in a ings of ACL-08: HLT, pages 870–878, Columbus, social network site. Student essay for the Ohio. Association for Computational Linguistics.

22 Gideon S. Mann and Andrew McCallum. 2010. Study in English Language Teaching, 1(1):187– Generalized expectation criteria for semi-supervised 199. http://www.scholink.org/ojs/ learning with weakly labeled data. The Journal of index.php/selt. Machine Learning Research, 11:955–984. Thamar Solorio and Yang Liu. 2008a. Learning to pre- Andrew Kachites McCallum. 2002. Mal- dict code-switching points. In Proceedings of the let: A machine learning for language toolkit. Conference on Empirical Methods in Natural Lan- http://mallet.cs.umass.edu. guage Processing, pages 973–981. Association for Computational Linguistics. Lesley Milroy and Pieter Muysken, editors. 1995. One speaker, two languages: Cross-disciplinary per- Thamar Solorio and Yang Liu. 2008b. Part-of-speech spectives on code-switching. Cambridge University tagging for English-Spanish code-switched text. In Press. Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing, pages 1051– Preslav Nakov, Sara Rosenthal, Zornitsa Kozareva, 1060. Association for Computational Linguistics. Veselin Stoyanov, Alan Ritter, and Theresa Wilson. 2013. Semeval-2013 task 2: Sentiment analysis in Thamar Solorio, Elizabeth Blair, Suraj Maharjan, Steve twitter. In Second Joint Conference on Lexical and Bethard, Mona Diab, Mahmoud Gonheim, Abdelati Computational Semantics (*SEM), Volume 2: Pro- Hawwari, Fahad AlGhamdi, Julia Hirshberg, Alison ceedings of the Seventh International Workshop on Chang, and Pascale Fung. 2014. Overview for the Semantic Evaluation (SemEval 2013), pages 312– first shared task on language identification in code- 320, Atlanta, Georgia, USA. Association for Com- switched data. In Proceedings of the First Workshop putational Linguistics. on Computational Approaches to Code-Switching. EMNLP 2014, Conference on Empirical Methods in Dong Nguyen and A. Seza Dogru˘ oz.¨ 2013. Word Natural Language Processing, Doha, Qatar. Associ- level language identification in online multilingual ation for Computational Linguistics. communication. In Proceedings of the 2013 Con- ference on Empirical Methods in Natural Language Joachim Wagner, Piyush Arora, Santiago Cortes, Utsab Processing (EMNLP 2013), pages 857–862, Seattle, Barman, Dasha Bogdanova, Jennifer Foster, and Washington, USA. Association for Computational Lamia Tounsi. 2014. DCU: Aspect-based polarity Linguistics. classification for SemEval task 4. In Proceedings of the International Workshop on Semantic Evalu- Olutobi Owoputi, Brendan O’Connor, Chris Dyer, ation (SemEval-2014), pages 392–397, Dublin, Ire- Kevin Gimpel, Nathan Schneider, and Noah A. land. Association for Computational Linguistics. Smith. 2013. Improved part-of-speech tagging for online conversational text with word clusters. In Jochen Weiner, Ngoc Thang Vu, Dominic Telaar, Flo- Proceedings of the 2013 Conference of the North rian Metze, Tanja Schultz, Dau-Cheng Lyu, Eng- American Chapter of the Association for Computa- Siong Chng, and Haizhou Li. 2012. Integration tional Linguistics: Human Language Technologies of language identification into a recognition system (NAACL-HLT 2013), pages 380–390, Atlanta, Geor- for spoken conversations containing code-switches. gia. Association for Computational Linguistics. In Proceedings of the 3rd Workshop on Spoken Lan- guage Technologies for Under-resourced Languages Mike Rosner and Paulseph-John Farrugia. 2007. A (SLTU’12), Cape Town, South Africa. International tagging algorithm for mixed language identifica- Research Center MICA. tion in a noisy domain. In INTERSPEECH-2007, 8th Annual Conference of the International Speech Hiroshi Yamaguchi and Kumiko Tanaka-Ishii. 2012. Communication Association, pages 190–193. ISCA Text segmentation by language using minimum de- Archive. scription length. In Proceedings of the 50th An- nual Meeting of the Association for Computational Raphael Rubino, Joachim Wagner, Jennifer Foster, Jo- Linguistics: Long Papers-Volume 1, pages 969–978. hann Roturier, Rasoul Samad Zadeh Kaljahi, and Association for Computational Linguistics. Fred Hollowood. 2013. DCU-Symantec at the WMT 2013 quality estimation shared task. In Pro- ceedings of the Eighth Workshop on Statistical Ma- chine Translation, pages 392–397, Sofia, Bulgaria. Association for Computational Linguistics.

Hong Ka San. 2009. Chinese-English code-switching in blogs by Macao young people. Master’s the- sis, The University of Edinburgh, Edinburgh, UK. http://hdl.handle.net/1842/3626.

Latisha Asmaak Shafie and Surina Nayan. 2013. Languages, code-switching practice and primary functions of facebook among university students.

23