All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of borrowing in social media

1Jasabanta Patro, 2Bidisha Samanta, 2Saurabh Singh, 2Abhipsa Basu,2Prithwish Mukherjee, 3Monojit Choudhury, 4Animesh Mukherjee 1,2,4 Indian Institute of Technology Kharagpur, India – 721302 6Microsoft Research India, Bangalore – 560001 [email protected], [email protected], [email protected]

Abstract than the Dutch equivalent “uitverkoop”. Some English like “shop” are even inflected in In this paper, we present a set of compu- Dutch as “shoppen” and heavily used. While it is tational methods to identify the likeliness difficult in general to ascertain whether a foreign of a word being borrowed, based on the word or phrase used in an utterance is borrowed signals from social media. In terms of or just an instance of code-mixing (Bali et al., Spearman’s correlation values, our meth- 2014), one tell tale sign is that only proficient mul- ods perform more than two times better tilinguals can code-mix, while even monolingual ( 0.62) in predicting the borrowing like- speakers can use borrowed words because, by def- ∼ liness compared to the best performing inition, these are part of the vocabulary of a lan- baseline ( 0.26) reported in literature. guage. In other words, just because an English ∼ Based on this likeliness estimate we asked speaker understands and uses the word “tortilla” annotators to re-annotate the language tags does not imply that she can speak or understand of foreign words in predominantly native Spanish. A borrowed word from L2 initially ap- contexts. In 88% of cases the annotators pears frequently in speech, then gradually in print felt that the foreign language tag should be media like newspaper and finally it loses its ori- replaced by native language tag, thus in- gin’s identity and is used in L1 resulting in an in- dicating a huge scope for improvement of clusion in the dictionary of L1 (Myers-Scotton, automatic language identification systems. 2002; Thomason, 2003). Borrowed words often take several years before they formally become 1 Introduction part of L1 dictionary. This motivates our research question “is early-stage automatic identification of In social media communication, multilingual peo- likely to be borrowed words possible?”. This is ple often switch between languages, a phe- known to be a hard problem because (i) it is a nomenon known as code-switching or code- socio-linguistic phenomenon closely related to ac- mixing (Auer, 1984). This makes language iden- ceptability and frequency, (ii) borrowing is a dy- tification and tagging, which is perhaps a pre- namic process; new borrowed words enter the lex- requisite for almost all other language processing icon of a language as old words, both native and tasks that follow, a challenging problem (Barman borrowed, might slowly fade away from usage, et al., 2014). In code-mixing people are subcon- and (iii) it is a population level phenomenon that sciously aware of the foreign origin of the code- necessitates data from a large portion of the pop- mixed word or the phrase. A related but linguisti- ulation unlike standard natural language corpora cally and cognitively distinct phenomenon is lex- that typically comes from a very small set of au- ical borrowing (or simply, borrowing), where a thors. Automatic identification of borrowed words word or phrase from a foreign language say L2 in social media content (SMC) can improve lan- is used as a part of the vocabulary of native lan- guage tagging by recommending the tagger to tag guage say L1. For instance, in Dutch, the En- the language of the borrowed words as L instead glish word “sale” is now used more frequently 1

2254 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2254–2264 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics of L . The above reasons motivate us to resort with that of the ground truth is 0.65). 2 ∼ to the social media (in particular, Twitter), where 4. Finally, we obtain an excellent re-annotation a large population of bilingual/multilingual speak- accuracy of 88% for the words falling in the surely ers are known to often tweet in code-mixed collo- borrowed category as predicted by our metrics. quial languages (Carter et al., 2013; Solorio et al., 2014; Vyas et al., 2014; Jurgens et al., 2017; Ri- 2 Related work jhwani et al., 2017). We designed our methodol- In linguistics code-mixing and borrowing have ogy to work for any pair of languages L1 and L2 subject to the availability of sufficient SMC. In the been studied under the broader scope of language change and evolution. Linguists have for a long current study, we consider Hindi as L1 and English time focused on the sociological and the con- as L2. The main stages of our research are as follows: versational necessity of borrowing and mixing in multilingual communities (see Auer(1984) and Metrics to quantify the likeliness of borrowing Muysken(1996) for a review). In particular, from social media signals: We define three novel Sankoff et al.(1990) describes the complexity of and closely similar metrics that serve as social sig- choosing features that are indicative of borrow- nals indicating the likeliness of borrowing. We ing. This work further showed that it is not al- compare the likeliness of borrowing as predicted ways true that only highly frequent words are bor- by our model and a baseline model with that from rowed; nonce words could also be borrowed along the ground truth obtained from human judges. with the frequent words. More recently, (Nzai Ground truth generation: We launch an exten- et al., 2014) analyzed the formal conversation of sive survey among 58 human judges of various Spanish-English multilingual people and found age groups and various educational backgrounds that code mixing/borrowing is not only restricted to collect responses indicating if each of the can- to daily speech but is also prevalent in formal con- didate foreign word is likely borrowed. versations. (Hadei, 2016) showed that phonolog- Application: We randomly selected some words ical integration could be evaluated to understand that have a high, low and medium borrowing like- the phenomenon of word borrowing. Along sim- liness as predicted by our metrics. Further, we ilar lines, (Sebonde, 2014) showed morphological randomly selected one tweet for each of the cho- and syntactic features could be good indicators for sen words. The chosen words in almost all of numerical borrowings. (Senaratne, 2013) reported these tweets have L2 as their language tag while a that in many languages English words are likely to majority of the surrounding words have a tag L1. be borrowed in both formal and semi-formal text. We asked expert annotators to re-evaluate the lan- Mixing in computer mediated communica- guage tags of the chosen words and indicate if they tion and social media:(Sotillo, 2012) investi- would prefer to switch this tag from L2 from L1. gated various types of code-mixing in a corpora Finally, our key results are outlined below: of 880 SMS text messages. The author observed 1. We obtained the Spearman’s rank correlation that most often mixing takes place at the beginning between the ground-truth ranking and the ranking of a sentence as well as through simple insertions. based on our metrics as 0.62 for all the three ∼ Similar observations about chat and email mes- variants which is more than double the value ( ∼ sages have been reported in (Bock, 2013; Negron´ , 0.26) if we use the most competitive baseline (Bali 2009). However, studies of code-mixing with et al., 2014) available in the literature. Chinese-English bilinguals from Hong Kong (Li, 2. Interestingly, the responses of the judges in the 2009) and Macao (San, 2009) brings forth results age group below 30 seem to correspond even bet- that contrast the aforementioned findings and in- ter with our metrics. Since language change is dicate that in these societies code-mixing is driven brought about mostly by the younger population, more by linguistic than social motivations. this might possibly mean that our metrics are able Recently, the advent of social media has im- to capture the early signals of borrowing. mensely propelled the research on code-mixing 3. Those users that mix languages the least in their and borrowing as a dynamic social phenom- tweets present the best signals of borrowing in ena. (Hidayat, 2012) noted that in Facebook, case they do mix the languages (correlation of our users mostly preferred inter-sentential mixing and metrics estimated from the tweets of these users showed that 45% of the mixing originated from

2255 real lexical needs, 40% was used for conversa- how certain simple and closely related signals en- tions on a particular topic and the rest 5% for con- coding the language usage of social media users tent clarification. In contrast, (Das and Gamback¨ , can help us construe appropriate metrics to quan- 2014) showed that in case of Facebook messages, tify the likeliness of borrowing of a foreign word. intra-sentential mixing accounted for more than half of the cases while inter-sentential mixing ac- 3 Methodology counted only for about one-third of the cases. In In this section, we present the baseline metric and fact, in the First Workshop on Computational Ap- propose three new metrics that quantify the likeli- proaches to Code Switching a shared task on code- ness of borrowing. mixing in tweets was launched and four differ- ent code-mixed corpora were collected from Twit- 3.1 Baseline metric ter as a part of the shared task (Solorio et al., FL2 Baseline metric – We consider the log( F ) value 2014). Language identification task has also been L1 proposed in (Bali et al., 2014) as the baseline met- handled for English-Hindi and English-Bengali ric. Here F denotes the frequency of the L code-mixed tweets in (Das and Gamback¨ , 2013). L2 1 transliterated form of the word w in the standard Part-of-speech tagging have been recently done L newspaper corpus. F , on the other hand, de- for code-mixed English-Hindi tweets (Solorio and 1 L1 notes the frequency of the L translation of the Liu, 2008; Vyas et al., 2014). 1 word w in the same newspaper corpus. For our Diachronic studies: As an aside, it is interest- experiments discussed in the later sections, both ing to note that the availability of huge volumes of the transliteration and the translation of the words timestamped data (tweet streams, digitized books) have been done by a set of volunteers who are na- is now making it possible to study various lin- tive L1 speakers. The authors in (Bali et al., 2014) guistic phenomena quantitatively over different claim that the more positive the value of this met- timescales. For instance, (Sagi et al., 2009) uses ric is for a word w, the higher is the likeliness of latent semantic analysis for detection and tracking its being borrowed. The more negative the value of changes in word meaning, whereas (Frermann is, the higher are the chances that the word w is an and Lapata, 2016) presents a Bayesian approach instance of code-mixing. for the same problem. (Peirsman et al., 2010) Ranking – Based on the values obtained from the presents a distributed model for automatic identi- above metric for a set of target words, we rank fication of lexical variation between language va- these words; words with high positive values fea- rieties. (Bamman and Crane, 2011) discusses a ture at the top of the rank list and words with high method for automatically identifying word sense negative values feature at the bottom of the list. F variation in a dated collection of historical books For two words having the same log( L2 ) value, FL1 .(Mitra et al., 2014) presents a computational we resolve the conflict by assigning each of these method for automatic identification of change in the average of their two rank positions. In a subse- word senses across different timescales. (Cook quent section, we shall compare this rank list with et al., 2014) presents a method for novel sense the one obtained from the ground truth responses. identification of words over time. Despite these diverse and rich research agendas 3.2 Proposed metric in the field of code-switching and lexical dynam- In this section, we present three novel and closely ics, there has not been much attempt to quantify related metrics based on the language usage pat- the likeliness of borrowing of foreign words in a terns of the users of social media (in specific, Twit- language. The only work that makes an attempt ter). In order to define our metrics, we need all the in this direction is (Bali et al., 2014), which is de- words to be language tagged. The different tags scribed in detail in Sec 3.1. One of the primary that a word can have are: L1, L2, NE (Named En- challenges faced by any quantitative research on tity) and Others. Based on the word level tag, we lexical borrowing is that borrowing is a social phe- also create a tweet level tag as follows: nomenon, and therefore, it is difficult to identify 1. L1: Almost every word (> 90%) in the tweet suitable indicators of such a lexical diffusion pro- is tagged as L1. cess unless one has access to a large population- 2. L2: Almost every word (> 90%) in the tweet level data. In this work, we show for the first time is tagged as L2.

2256 3. CML1: Code-mixed tweet but majority (i.e., 4.1 Datasets and preprocessing > 50%) of the words are tagged as L1. In this study, we consider code-mixed tweets gath- 4. CML2: Code-mixed tweet but majority (i.e., ered from Hindi-English bilingual Twitter users in > 50%) of the words are tagged as L2. order to study the effectiveness of our proposed 5. CMEQ: Code-mixed tweet having very sim- metrics. The native language L1 is Hindi and ilar number of words tagged as L1 and L2 the foreign language L is English. To bootstrap respectively. 2 the data collection process, we used the language 6. Code Switched: There is a trail of L1 words tagged tweets presented in (Rudra et al., 2016). followed by a trail of L2 words or vice versa. In addition to this, we also crawled tweets (be- Using the above classification, we define the tween Nov 2015 and Jan 2016) related to 28 hash- following metrics: tags representing different Indian contexts cov- Unique User Ratio (UUR) – The Unique User Ra- ering important topics such as sports, religion, tio for word usage across languages is defined as movies, politics etc. This process results in a follows: set of 811981 tweets. We language-tag (see de-

UL1 + UCML1 tails later in this section) each tweet so crawled UUR(w) = (1) UL2 and find that there are 3577 users who use mixed language for tweeting. Next, we systematically where U (U , U ) is the number of unique L1 L2 CML1 crawled the time lines of these 3577 users between users who have used the word w in a L1 (L2, Feb 2016 and March 2016 to gather more mixed CML1) tweet at least once. language tweets. Using this two step process we Unique Tweet Ratio (UTR) – The Unique Tweet collected a total of 1550714 distinct tweets. From Ratio for word usage across languages is defined this data, we filtered out tweets that are not writ- as follows: ten in romanized script, tweets having only URLs

TL1 + TCML1 and tweets having empty content. Post filtering we UTR(w) = (2) obtained 725173, tweets which we use for the rest TL2 of the analysis. The datasets can be downloaded where TL1 (TL2 , TCML1 ) is the total number of L1 from http://cnerg.org/borrow (L2, CML1) tweets which contain the word w. Language tagging: We tag each word in a tweet Unique Phrase Ratio (UPR) – The Unique Phrase with the language of its origin using the method Ratio for word usage across languages is defined outlined in (Gella et al., 2013). Hi represents a pre- as follows: dominantly Hindi tweet, En represents a predom- PL UPR(w) = 1 (3) inantly English tweet, CMH (CME) represents P L2 code-mixed tweets with more Hindi (English) where PL1 (PL2 ) is the number of L1 (L2) phrases words, CMEQ represents code-mixed tweets with which contain the word w. Note that unlike the almost equal number of Hindi and English words definitions of UUR and UTR that exploit the and CS represents code-switched tweets (the num- word level language tags, the definition of UPR ber and % of tweets in each of the above six cat- exploits the phrase level language tags. egories are presented in the supplementary mate- Ranking – We prepare a separate rank list of the rial). Like the word level, the tagger also provides target words based on each of the three proposed a phrase level language tag. Once again, the differ- metrics – UUR, UTR and UPR. The higher the ent tags that an entire phrase can have are: En, Hi value of each of this metric the higher is the like- and Oth (Other). The metrics defined in the pre- liness of the word w to be borrowed and higher vious section are computed using these language up it is in the rank list. In a subsequent section, tags. we shall compare these rank lists with the one pre- Newspaper dataset for the baseline: As we had pared from the ground truth responses. discussed in the previous section, for the construc- tion of the baseline ranking we need to resort to 4 Experiments counting the frequency of the foreign words (i.e., In this section we discuss the dataset for our ex- English words) and their Hindi translations in a periments, the evaluation criteria and the ground newspaper corpus as has been outlined in (Bali truth preparation scheme. et al., 2014). For this purpose, we use the FIRE

2257 dataset built from the Hindi Jagaran newspaper we arrange the 230 words into contextually corpus1 which is written in Devanagari script. similar groups (see supplementary material for the grouping details). Finally, using the baseline 4.2 Target word selection metric log( FE ) (E: English, H: Hindi), we FH We first compute the most frequent foreign (i.e., proportionately choose words from these groups English words) in our tweet corpus. Since we are as follows: interested in the frequency of the English word Words with very high or very low values of log( FE ) (hlws) – we select words having the only when it appears as a foreign word we do not FH consider the (i) Hi tweets since they do not have highest and the lowest values of log( FE ) from FH any foreign word, (ii) En tweets since here the each of the context groups. This constitutes English words are not foreign words and the (iii) a set of 30 words. Note that these words are code-switched tweets. Based on the frequency of baseline-biased and therefore the metric should be usage of English as a foreign word, we select the able to discriminate them well. top 1000 English words. Removal of stop words Words with medium values of log( FE ) (mws)– FH and text normalization leaves beyond 230 nouns we selected 27 words having not so high and not (see supplementary material for the list of words). so low log( FE ) at uniformly at random. FH Final selection of target words: In language Full set of words (full) – Thus, in total we processing, context plays an important role in selected 57 target words for the purpose of our understanding different properties of a word. evaluation. We present these words in the box For our study, we also attempt to use the lan- below. guage tags as features of the context words Baseline-biased words – ’thing’, ’way’, ’woman’, ’press’, ’wrong’, ’well’, ’matter’, ’reason’, ’ques- for a given target word. Our hypothesis here tion’, ’guy’, ’moment’, ’week’, ’luck’, ’president’, is that there should exist classes of words that ’body’, ’job’, ’car’, ’god’, ’gift’, ’status’, ’univer- have similar context features and the likelihood sity’, ’lyrics’, ’road’, ’politics’, ’parliament’, ’review’, ’scene’, ’seat’, ’film’, ’degree’ of being borrowed in each class should be Randomly selected words – ’people’, ’play’, ’house’, different. For example, when an English word ’service’, ’rest’, ’boy’, ’month’, ’money’, ’cool’, ’de- is surrounded by mostly Hindi words it seems velopment’, ’group’, ’friend’, ’day’, ’performance’, ’school’, ’blue’, ’room’, ’interview’, ’share’, ’request’, to be more likely borrowed. We present two ’traffic’, ’college’, ’star’, ’class’, ’superstar’, ’petrol’, examples in the box below to illustrate this. ’uncle’ Example I: 4.3 Evaluation criteria @****** Welcome. Film jaroor dekhna. Nahi to injection ready hai. We present a four step approach for evaluation as UUR UTR Example II: follows. We measure (i) how well the , UPR hlws mws @***** Trust @***** and based ranking of the set, the full Kuch to ache se karo sirji.... set and the set correlate with the ground truth ranking (discussed in the next section) in compar- Har jagah bhaagte rehna is not a good thing. ison to the rank given by the baseline metric, (ii) how well the different rank ranges obtained from In Example I the English word “film” is sur- our metric align with the ground truth as compared rounded by mostly Hindi words. On the other to the baseline metric, (iii) whether there are some hand, in Example II the English word “thing” is systematic effects of the age group of the survey surrounded mostly by English words. Note that participants on the rank correspondence, (iv) how the word “film” is very commonly used by Hindi our metrics if computed from the tweets of users monolingual speakers and is therefore highly who (a) rarely mix languages, (b) almost always likely to have been borrowed unlike the English mix languages and (c) are in between (a) and (b), word “thing” which is arguably an instance of align with the ground truth. mixing. This socio-linguistic difference seems to Rank correlation: We measure the standard be very appropriately captured by the language Spearman’s rank correlation (ρ)(Zar, 1972) pair- tag of the surrounding words of these two words wise between rank lists generated by (i) UUR in the respective tweets. Based on this hypothesis (ii) UTR (iii) UPR (iv) baseline metric and the 1Jagaran corpus: http:/fire.irsi.res.in/ ground truth. fire/static/data We shall describe the next four measurements

2258 taking UUR as the running example. The same mixed), (ii) Mid (users who have 7–20% of their can be extended verbatim for the other two similar tweets as code-mixed, and (iii) Low (users who metrics. have less than 7% of their tweets as code-mixed). Rank ranges: We split each of the three rank lists We create three UUR based rank lists for each of (UUR, ground truth and baseline) into five differ- these three user categories and respectively com- ent equal-sized ranges as follows – (i) surely bor- pare them with the ground truth rank list. rowed (SB) containing top 20% words from each list, (ii) likely borrowed (LB) containing the next 4.4 Ground truth preparation 20% words from each list, (iii) borderline (BL) Since it is very difficult to obtain a suitable ground constituting the subsequent 20% words from each truth to validate the effectiveness of our proposed list, (iv) likely mixed (LM) comprising the next ranking scheme, we launched an online survey to 20% words from each list and (v) surely mixed collect human judgment for each of the 57 target (SM) having the last 20% words from each rank words. list. Therefore, we have three sets of five buckets, Online survey We conducted the online survey2 one set each for UUR, the ground truth and the among 58 volunteers majority of whom were ei- baseline based rank list. ther native language(Hindi) speakers or had very Next we calculate the bucket-wise correspon- high proficiency in reading and writing in that lan- dence between (i) the UUR and the ground truth guage. The participants were selected from dif- set and (ii) the baseline and the ground truth set in ferent age groups and different educational back- terms of standard precision and recall measures. grounds. Every participant was asked to respond For our purpose, we adapt these measures as fol- to a multiple choice question about each of the lows. 57 target words. Therefore, for every single tar- G: ground truth bucket set, Bb: baseline bucket get word, 58 responses were gathered. The mul- set, Ub: UUR bucket set; tiple choice question had the following three op- BS B ,U , T (type of bucket) = SB, LB, tions and the participants were asked to select the ∈ { b b} { BL, LM, SM ; one they preferred the most and found more natu- } bt = words in type t bucket from BS, gt = words ral – (i) a Hindi sentence with the target word as in type t bucket from G, t T ; the only English word, (ii) the same Hindi sen- ∈ tp (no. of true positives) = b g , fp (no. of tence in (i) but with the target word replaced by t | t ∩ t| t false positives) = b g , tn (no. of true nega- its Hindi translation and (iii) none of the above two | t − t| t tives) = g b ; options. There were no time restrictions imposed | t − t| Bucket-wise precision and recall therefore: while gathering the responses, i.e., the volunteers precision(b ) = tpt ; recall(b ) = tpt theoretically had unlimited time to decide their re- t fpt+tpt t tnt+tpt For a given set, we obtained the overall macro pre- sponses for each target word. cision (recall) by averaging the precision (recall) Language preference factor For each target values over the five buckets. For a given set, we word, we compute a language preference fac- also obtained the overall micro precision by first tor (LP F ) defined as (Count Count ), En − Hi adding the true positives across all the buckets and where CountHi refers to the number of survey then normalizing by the sum of the true and the participants who preferred the sentence contain- false positives over all the buckets. We take an ing the Hindi translation of the target word while equivalent approach for obtaining the micro recall. CountEn refers to the number of survey partic- Age group effect: Here we construct two ground ipants who preferred the sentence containing the truth rank lists one using the responses of the par- target word itself. More positive values of LP F ticipants with age below 30 (young population) denotes higher usage of target word as compared and the other using the responses of the rest of the to its Hindi translation and therefore higher likeli- participants (elderly population). Next we repeat ness of the word being borrowed. the above two evaluations considering each of the Ground truth rank list generation We generate new ground truth rank lists. the ground truth rank list based on the LP F score of a target word. The word with the highest value Extent of language mixing: Here we divide all the 3577 users into three categories – (i) High 2Survey portal: https://goo.gl/forms/ (users who have more than 20% of tweets as code- L0kJm8BNMhRj0jA53

2259 Rank-List Rank-List ρ hlws ρ mws ρ full Bucket type Ground truth bucket Baseline bucket UUR bucket 1 2 − − − UUR SB 11 11 11 Ground truth 0.67 0.64 0.62 LB 11 11 11 UTR Ground truth 0.66 0.63 0.63 BL 12 12 12 UPR Ground truth 0.66 0.64 0.62 LM 11 11 11 Baseline Ground truth 0.49 0.14 0.26 SM 12 12 12

Table 1: Spearman’s rank correlation coefficient Table 2: Number of words falling in each bucket (ρ) among the different rank lists. Best result is of three bucket sets. marked in bold. Bucket type prec./rec. Baseline prec./rec. UUR SB 0.27 0.27 LB 0.09 0.18 BL 0.08 0.33 of LP F appears at the top of the ground truth rank LM 0.18 0.36 list and so on in that order. Tie breaking between SM 0.33 0.50 target words having equal LP F values is done by Table 3: Bucket-wise precision/recall. Best results assigning average rank to each of these words. are marked in bold. Age group based rank list: As discussed in the previous section, we prepare the age group based rank lists by first splitting the responses of the sur- 5.2 Rank list alignment across rank ranges vey participants in two groups based on their age The number of target words falling in each bucket – (i) young population (age < 30) and (ii) elderly across the three rank lists are the same and are population (age 30). For each group we then noted in table2. Thus, the precision and recall as ≥ construct a separate LP F based ranking of the per the definition are also the same. The bucket- target words. wise precision/recall for the baseline and UUR with respect to the ground truth are noted in ta- ble3. We observe that while in the SB bucket 5 Results both the baseline and UUR perform equally well, for all the other buckets UUR massively outper- 5.1 Correlation among rank lists forms the baseline. This implies that for the case The Spearman’s rank correlation coefficient (ρ) of where the likeliness of borrowing is the strongest, the rank lists for the hlws set, the mws set and the the baseline does as good as UUR. However, as full set according to the baseline metric, UUR, one moves down the rank list, UUR turns out to UTR and UPR with respect to the ground truth be a considerably better predictor than the base- metric LP F are noted in table1. We observe that line. The overall macro and micro precision/recall for the full set, the ρ between the rank lists ob- as shown in table4 further strengthens our obser- tained from all the three metrics UUR, UTR, and vation that UUR is a better metric than the base- UPR with respect to the ground truth is 0.62 line. ∼ which is more than double the ρ ( 0.26) between ∼ 5.3 Age group based analysis the baseline and the ground truth rank list. This clearly shows that the proposed metrics are able As already discussed earlier, we split the ground to identify the likeliness of borrowing quite accu- truth responses based on the age group of the sur- rately and far better than the baseline. Further, a vey participants. We split the responses into two remarkable observation is that our metrics outper- groups – (i) young population (age < 30) and (ii) elderly population (age 30) so that there are al- form the baseline metric even for the hlws set that ≥ is baseline-biased. Likewise, for the mws set, our most equal number of responses in both the groups metrics outperform the baseline indicating a supe- (see supplementary material for the exact distribu- rior recall on arbitrary words. The ranking of the tion). full set of words obtained from the ground truth, Rank correlation: The Spearman’s rank correla- the baseline and the UUR metric is available in tion of UUR and the baseline rank lists with these the supplementary material. Measure Baseline UUR Macro prec./rec. 0.19 0.33 We present the subsequent results for the full Micro prec./rec. 0.19 0.33 set and the UUR metric. The results obtained for the other two metrics UTR and UPR are very Table 4: Overall macro and micro precision/recall. similar and therefore not shown. Best results are marked in bold.

2260 Rank-List1 Rank-List2 ρ Measure Young- Young- Elder- Elder- Baseline Ground-truth-Young 0.26 Baseline UUR Baseline UUR UUR Ground-truth-Young 0.62 Mac. pre. 0.19 0.33 0.22 0.27 Baseline Ground-truth-Elder 0.27 Mac. rec. 0.19 0.33 0.23 0.25 UUR Ground-truth-Elder 0.53 Mic. 0.19 0.33 0.23 0.26 pre./rec. Table 5: Spearman’s rank correlation across the two age groups. Best results are marked in bold. Table 7: Overall macro and micro precision and recall for the two new ground truths. Best results Bucket Young- Young- Elder- Elder- Elder- Elder- are marked in bold. type BaselineUUR BaselineUUR baseline UUR p/r p/r p p r r Bucket Number of users ρ SB 0.27 0.27 0.27 0.36 0.25 0.33 LB 0.09 0.18 0.09 0.18 0.08 0.17 High 302 0.52 Mid 744 0.60 BL 0.08 0.33 0.16 0.08 0.28 0.14 Low 2531 0.65 LM 0.18 0.36 0.18 0.45 0.14 0.35 SM 0.33 0.5 0.41 0.25 0.41 0.25 Table 8: Spearman’s correlation between UUR Table 6: Bucket-wise precision (p)/recall (r) for and the ground truth in the different user buckets. UUR and the baseline metrics for the two new Best results are marked in bold. ground truths. Best results are marked in bold. relation between UUR and the ground truth for two ground truth rank lists are shown in table5. each of these buckets are given in table8. As we Interestingly, the correlation between UUR rank can see, for Low bucket the ρ value is maximum. list and the young population ground truth is better This points to the fact that the signals of borrow- than the elderly population ground truth. This pos- ing is strongest from the users who rarely mix lan- sibly indicates that UUR is able to predict recent guages. borrowings more accurately. However, note that the UUR rank list has a much higher correlation 6 Re-annotation results with both the ground truth rank lists as compared In order to conduct the re-annotation experiments to the baseline rank list. we performed the following. To begin with, we Rank ranges: Table6 shows the bucket-wise pre- ranked all the 230 English nouns in non-increasing cision and recall for UUR and the baseline met- order of their UUR values. We then randomly se- rics with respect to two new ground truths. For the lected 20 words each having (i) high UUR (top young population once again the number of words 20%) values (call TOP ), (ii) low UUR (bottom in each bucket for all the three sets is the same thus 20%) values (call BOT ), and (iii) middle UUR making the values of the precision and the recall (middle 20%) values (call MID). This makes a same. In fact, the precision/recall for this ground total of 60 words. Using this we ex- truth is exactly same as in the case of the original tracted one tweet each that contained the (foreign) ground truth. word from this list along with all other words in In contrast, when we consider the ground truth the tweet tagged in Hindi (Hall). We similarly pre- based on the responses of the elderly population, pared another such list of 60 words and extracted the number of words across the different buckets one tweet each in which most of the other words are different across the three sets. In this case, we were tagged in Hindi (Hmost). observe that the precision/recall values are better We presented the selected words and the cor- for the UUR metric in SB, LB and LM buckets. responding tweets to a set of annotators and Finally, the overall macro and micro precision asked them to annotate these selected words once and recall for both the age groups are noted in again. Over all the words, we calculated the mean table7. Once again, for both the young and the (µE H ) and the standard deviation (σE H ) of the → → elderly population based ground truths, the macro fraction of cases where the annotators altered the and micro precision and recall values for the UUR tag of the selected word from English to Hindi. metric are higher compared to that of the baseline. The average inter-annotator agreement (Fleiss, 1971) for our experiments was found to be as high 5.4 Extent of language mixing as 0.64. For the words in the TOP list, the frac- As mentioned earlier, we divide the set of 3577 tion of times the tag is altered is 0.91 (0.85) with users into three categories. The Spearman’s cor- an inter-annotator agreement of 0.84 (0.80) for the

2261 Word-rank Context µE H σE H → → has become much more prevalent than its corre- TOP Hall 0.91 0.15 TOP Hmost 0.85 0.23 sponding Hindi translation ‘vidyalaya’ among the MID Hall 0.58 0.28 MID Hmost 0.61 0.34 native Hindi speakers. Finally, therapeutic bor- BOT Hall 0.13 0.18 rowing refers to borrowing of words to avoid taboo BOT H 0.16 0.21 most and homonomy in the native language. In this Table 9: Re-annotation results paper, although we did not perform any category based studies, most of our focus was on core bor- rowing. Hall (Hmost) category. In other words, on average, in as high as 88% of cases the annotators altered Language of social media users: We assumed the tags of the words that are highly likely to be that if a user is predominantly using Hindi words borrowed (i.e., TOP ) in a largely Hindi context in a tweet then the chances of him/her being a na- (i.e., Hall or Hmost). Table9 shows the fractional tive Hindi speaker should be high, since, while changes for all the other possible cases. An in- the number of English native speakers in India teresting observation is that the annotators rarely is 0.02%, the number of Hindi native speakers is 4 flipped the tags for the words in the BOT list (i.e., 41.03% . This assumption has also been made the sure mixing cases) in either of the Hall or the in earlier studies (Rudra et al., 2016). Note that Hmost contexts. These results strongly support the even if a user is not a native Hindi speaker but a inclusion of our metric in the design of future au- proficient (or semi-proficient) Hindi speaker, the tomatic language tagging tasks. main results of our analysis should hold. For in- stance, consider two foreign words ‘a’ and ‘b’. If 7 Discussion and conclusion ‘a’ is frequently borrowed in the native language, then the proficient speaker would also tend to bor- In this paper, we proposed a few new metrics for row ‘a’ similar to a native speaker. Even if due estimating the likliness of borrowing that rely on to lack of adequate native vocabulary, the non- signals from large scale social media data. Our native speaker borrows the word ‘b’ in some cases, best metric is two-fold better than the previous these spurious signals should get eliminated since metric in terms of the accuracy of the predic- we are making an aggregate level statistics over a tion. There are some interesting linguistic aspects large population. of borrowing as well as certain assumptions re- garding the social media users that have important Future directions: It would be interesting to repercussions on this work and its potential exten- understand and develop theoretical justification sions, which are discussed in this section. for the metrics. Further, it would be useful to Types of borrowing: Linguists define broadly study and classify various other linguistic phe- three forms of borrowing, (i) cultural, (ii) core, nomena closely related to core borrowing, such as: and (iii) therapeutic borrowings. In cultural bor- (i) loanword, where a form of a foreign word and rowing, a foreign word gets borrowed into native its meaning or one component of its meaning gets language to fill a lexical gap. This is because there borrowed, (ii) calques, where a foreign word or is no equivalent native language word present to idiom is translated into existing words of native represent the same foreign word concept. For in- language, and (iii) semantic loan, where the word stance, the English word ‘computer’ has been bor- in the native language already exists but an addi- rowed in many Indian languages since it does not tional meaning is borrowed from another language have a corresponding term in those languages3. In and added to existing meaning of the word. core borrowing , on the other hand, a foreign word Finally, we would also like to incorporate our replaces its native language translation in the na- findings into other standard tasks of multilingual tive language vocabulary. This occurs due to over- IR and multilingual speech synthesis (for example whelming use of the foreign word over native lan- to render the appropriate native accent to the bor- guage translation as a matter of prestige, ease of rowed word). use etc. For example, the English word ‘school’

3Words for “computer” were coined in many Indian lan- guages through morphological derivation from the term for “compute”, however, none of these words are used in either formal or informal contexts. 4http://bit.ly/2ufOvea

2262 References David Jurgens, Yulia Tsvetkov, and Dan Jurafsky. 2017. Incorporating dialectal variability for socially P. Auer. 1984. Bilingual Conversation. John Ben- equitable language identification. ACL 2017. jamins. K. Bali, J. Sharma, M. Choudhury, and Y. Vyas. 2014. D. C. S. Li. 2009. Cantonese-english code-switching “i am borrowing ya mixing?” an analysis of english- research in hong-kong: a y2k review. World En- hindi code mixing in facebook. In First workshop glishes, 19(3):305–322. on Computational approaches to code-switching, EMNLP, page 116. Sunny Mitra, Ritwik Mitra, Martin Riedl, Chris Bie- mann, Animesh Mukherjee, and Pawan Goyal. David Bamman and Gregory Crane. 2011. Measuring 2014. That’s sick dude!: Automatic identification historical word sense variation. In Proceedings of of word sense change across different timescales. the 11th annual international ACM/IEEE joint con- arXiv preprint arXiv:1405.4392. ference on Digital libraries, pages 1–10. ACM. P. Muysken. 1996. Code-switching and grammatical Utsab Barman, Amitava Das, Joachim Wagner, and theory. In One speaker, two languages: Cross- Jennifer Foster. 2014. Code mixing: A challenge disciplinary perspectives on code-switching, pages for language identification in the language of social 177–198. Cambridge University Press. media. EMNLP 2014, 13. C. Myers-Scotton. 2002. Contact linguistics: Bilin- Z. Bock. 2013. Cyber socialising: Emerging genres gual encounters and grammatical outcomes. Oxford and registers of intimacy among young south african University Press. students. Language Matters: Studies in the Lan- guages of Africa, 44(2):68–91. R. G. Negron.´ 2009. Spanish-english code-switching in email communication. Language@Internet, 6(3). S. Carter, W. Weerkamp, and M. Tsagkias. 2013. Micro-blog language identification: Overcoming the V. E. Nzai, Y-L. Feng, M. R. Medina-Jimenez,´ and limitations of short, unedited and idiomatic text. J. Ekiaka-Oblazamengo. 2014. Understanding Language Resources and Evaluation, 47(1):195– code-switching & word borrowing from a pluralis- 215. tic approach of . Paul Cook, Jey Han Lau, Diana McCarthy, and Timo- thy Baldwin. 2014. Novel word-sense identification. Yves Peirsman, Dirk Geeraerts, and Dirk Speelman. In COLING, pages 1624–1635. 2010. The automatic identification of lexical varia- tion between language varieties. Natural Language A. Das and B. Gamback.¨ 2013. code-mixing in social Engineering, 16(4):469–491. media text: The last language identification frontier? TAL, 54(3):41–64. Shruti Rijhwani, Royal Sequiera, Monojit Choud- hury, Kalika Bali, and Chandra Shekhar Maddila. A. Das and B. Gamback.¨ 2014. Identifying languages 2017. Estimating code-switching on twitter with at the word level in code-mixed indian social media a novel generalized word-level language detection text. In ICON, pages 169–178. technique. ACL 2017.

Joseph L Fleiss. 1971. Measuring nominal scale agree- Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Ka- ment among many raters. Psychological bulletin, lika Bali, Monojit Choudhury, and Niloy Ganguly. 76(5):378. 2016. Understanding language preference for ex- pression of opinion and sentiment: What do hindi- Lea Frermann and Mirella Lapata. 2016. A bayesian english speakers do on twitter? In Proceedings of model of diachronic meaning change. Transactions the 2016 Conference on Empirical Methods in Nat- of the Association for Computational Linguistics, ural Language Processing, pages 1131–1141. 4:31–45. S. Gella, J. Sharma, and K. Bali. 2013. Query word la- Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2009. beling and back transliteration for indian languages: Semantic density analysis: Comparing word mean- Shared task system description. FIRE Working ing across time and phonetic space. In Proceedings Notes, 3. of the Workshop on Geometrical Models of Natu- ral Language Semantics, pages 104–111. Associa- M. Hadei. 2016. Single word insertions as code- tion for Computational Linguistics. switching or established borrowing? International Journal of Linguistics, 8(1):14–25. H. K. San. 2009. Chinese-english code-switching in blogs by macao young people. In MSc. thesis. Uni- T. Hidayat. 2012. An analysis of code switching used versity of Edinburgh. by facebookers (a case study in a social network site). In BA thesis. English Education Study Pro- D. Sankoff, S. Poplack, and S. Vanniarajan. 1990. The gram, College of Teaching and Education (STKIP), case of the nonce loan in tamil. Language variation Bandung, Indonesia. and change, 2(01):71–101.

2263 R. Y. Sebonde. 2014. Code-switching or lexical bor- rowing: Numerals in chasu language of rural tanza- nia. Journal of Arts and Humanities, 3(3):67. C D. W. Senaratne. 2013. Borrowings or code mixes: The presence of lone english nouns in mixed dis- course. T. Solorio, E. Blair, S. Maharjan, S. Bethard, M. Diab, M. Gohneim, A. Hawwari, F. AlGhamdi, J. Hirschberg, A. Chang, and P. Fung. 2014. Overview for the first shared task on language iden- tification in code-switched data. In EMNLP First Workshop on Computational Approaches to Code Switching, pages 62–72. T. Solorio and Y. Liu. 2008. Part-of-speech tagging for english-spanish code-switched text. In Proceedings of the Conference on Empirical Methods in Natu- ral Language Processing, pages 1051–1060. Asso- ciation for Computational Linguistics. S. Sotillo. 2012. Ehhhh utede hacen plane sin mi???:@ im feeling left out :( form, function and type of code switching in sms texting. In ICAME, pages 309– 310. S. G. Thomason. 2003. Contact as a source of lan- guage change. The handbook of historical linguis- tics, pages 687–712. Y. Vyas, S. Gella, J. Sharma, K. Bali, and M. Choud- hury. 2014. POS tagging of english-hindi code- mixed social media content. In EMNLP, volume 14, pages 974–979. J. H. Zar. 1972. Significance testing of the spearman’s rank correlation coefficient. Journal of the Ameri- can Statistical Association, 67(339):578–580.

2264