Annotation Efficient Language Identification from Weak Labels

Shriphani Palakodety∗ Ashiqur R. KhudaBukhsh∗ Onai Carnegie Mellon University [email protected] [email protected]

Abstract the ongoing COVID-19 pandemic, demand for lin- guistic resources might get recalibrated; informa- is home to several languages with more tion processing objectives might be heavily tilted than 30m speakers. These languages exhibit towards under-served languages that are prevalent significant presence on social media platforms. in many densely populated regions. However, several of these widely-used lan- language identification guages are under-addressed by current Natu- In this paper, we focus on ral Language Processing (NLP) models and in noisy, social media settings - a basic and highly resources. User generated social media con- critical linguistic resource prerequisite for down- tent in these languages is also typically au- stream analysis in a multilingual environment. Our thored in the Roman script as opposed to solution extends support for nine major Indian lan- the traditional native script further contribut- guages (see, Table1) spanning the native tongues of ing to resource scarcity. In this paper, we 85% of India’s population (Census, 2011). These leverage a minimally supervised NLP tech- under-resourced languages are heavily used in sev- nique to obtain weak language labels from a large-scale Indian social media corpus leading eral densely populated travel hubs and on social to a robust and annotation-efficient language- media. User generated web content in these lan- identification technique spanning nine Roman- guages is typically authored in the Roman script as ized Indian languages. In fast-spreading pan- opposed to the traditional native script leading to demic situations such as the current COVID- scarcer linguistic resources (Virga and Khudanpur, 19 situation, information processing objectives 2003; Choudhury et al., 2010; Gella et al., 2014; might be heavily tilted towards under-served Barman et al., 2014; Palakodety et al., 2020a). Ex- languages in densely populated regions. We re- lease our models to facilitate downstream anal- isting large-scale language identification tools pri- yses in these low-resource languages1. Exper- oritize the languages’ native scripts (e.g., (FastText; iments across multiple social media corpora Google)) over the Romanized variants. Our solu- demonstrate the model’s robustness and pro- tion focuses on the these Romanized variants and vide several interesting insights on Indian lan- is integrated with a widely used existing language- guage usage patterns on social media. We re- identification system (FastText) supporting 355 lan- lease an annotated data set of 1,000 comments guages. We release our open-source language in ten Romanized languages as a social media identification system1. to facilitate Indian social evaluation benchmark1. media analysis. 1 Introduction Annotator availability is a major concern that may constrain data acquisition efforts in low re- Much of the current NLP research focuses on a source settings (Joshi et al., 2019). Our proposed handful of world languages (e.g., English, French, solution is extremely annotation efficient; it uti- Spanish etc.). They enjoy substantially larger com- lizes a recent result (Palakodety et al., 2020a) to au- putational linguistic resources as compared to their tomatically group a multilingual corpus into largely low-resource counterparts (e.g., Bengali, Odia etc.). monolingual clusters that can be extracted with However, in the midst of global-scale events like minimal supervision. Using a mere 260 annotated short documents (YouTube video comments), we ∗ Shriphani Palakodety and Ashiqur R. KhudaBukhsh are equal-contribution first authors. Ashiqur R. KhudaBukhsh is 1Resources are available at: https://www.cs.cmu. the corresponding author. edu/˜akhudabu/IndicLanguage.html.

181 Proceedings of the 2020 EMNLP Workshop W-NUT: The Sixth Workshop on Noisy User-generated Text, pages 181–192 Online, Nov 19, 2020. c 2020 Association for Computational Linguistics assign weak labels to a data set of 2.8 million com- Indian General Election. ments spanning the aforementioned languages. Our model performs favorably when compared against Why YouTube? As of January 2020, YouTube is an existing commercial solution. the second-most popular social media platform in the world drawing 2 billion active users (Statista, While census data and surveys can provide 2020). YouTube is the most popular social media useful information about linguistic diversity and platform in India with 265 million monthly active spread, analyses of user-generated multi-lingual users (225 million on mobile), accounting for 80% corpora can complement these surveys with addi- of the population with internet access (Hindustan- tional useful insights of their own. We conduct a Times, 2019; YourStory, 2018). YouTube video focused analysis to explore if the (estimated) dis- comments have been used as data sources to an- tribution of web-usage of Hindi across different alyze recent important events (Palakodety et al., Indian states aligns with common knowledge. Our 2020a,c; Sarkar et al., 2020; Cinelli et al., 2020). analysis indicates that Hindi’s web-presence is con- siderably higher in a cluster of North Indian states referred to as the Hindi belt (Jaffrelot, 2000) as Language ISO code First language compared to the South Indian states. We further speakers analyze similar research questions concerning the Bengali bn 8.03% relative usage of the Roman script and the native Gujarati gu 4.58% Hindi hi 43.63% script for Hindi. We finally conclude with a small Kannada kn 3.61% exploratory study on our method’s effectiveness in Marathi mr 6.86% detecting languages with trace presence in multiple Malayalam ml 2.88% Odiya or 3.10% corpora and outline some of the possible utilities. Tamil ta 5.70% Contributions: Our main contributions of the pa- Te te 6.70% per are the following: Table 1: List of languages we considered with their cor- • Resource: We release an important linguistic responding ISO 693-1 codes and first language speak- resource to detect nine heavily-spoken Indic lan- ers as percentage of Indian population. Data is col- guages expressed in Roman script. lected from 2011 census (Census, 2011). • Method: We propose an annotation efficient method to construct this language identifier and demonstrate extensibility. • Linguistic: We conduct a web-scale analysis Why this data set? The data set considers two of Hindi usage shedding light on multilinguality, highly popular YouTube news channels for each geographic spread, and usage patterns. of the 12 Indian states that contribute 20 or more • Social: We outline how our tool can detect trace seats in the lower house of the parliament. State presence of other languages that can aid in con- boundaries in India were drawn along linguistic structing data sets for humanitarian challenges. lines (Dewen, 2010). The dominant regional lan- guage in the Hindi belt (Jaffrelot, 2000) is Hindi, 2 Data Set: YouTube Video Comments and the other states feature a unique dominant lan- guage written in either the Latin alphabet (in in- In order to construct our language identification formal settings) or a native script. All the nine system, we would require a web-scale Indian so- languages we focused on (listed in Table1), are cial media data set that (i) has considerable pres- the dominant language in one or more of these 12 ence of the nine languages we are interested in, states. The regional news networks considered pro- and (ii) captures a representative fraction of the vide coverage in the dominant regional language. Indian web users. To achieve this two-fold goal, Hence, the data set exhibits strong presence of all we consider a data set introduced in Palakodety the nine regional languages we are interested in. et al.(2020b) to analyze the 2019 Indian General In addition to these 24 regional news channels, Election. The data set consists of comments on the data set considers YouTube channels for 14 YouTube videos hosted by popular news outlets in highly popular national news outlets (listed in the India. Overall, the corpus consists of 6,182,868 Appendix). Overall, this implies 38 YouTube chan- comments on 130,067 videos by 1,518,077 users nels (24 regional, 14 national) with an average sub- posted in a 100 day period leading up to the 2019 scriber count of 3,338,628.

182 3 Related Work (predicting an input word’s context) is parameter- ized by real-valued word representations or em- Learning from weak labels: The role of unla- beddings (Mikolov et al., 2013). Bojanowski et al. beled and weakly (or noisily) labeled data in su- (2017) introduced sub-word extensions to the Skip- pervised learning is a well-studied problem and gram model to learn robust word representations has received sustained focus (Mitchell, 2004; Don- even in the presence of misspellings or spelling mez et al., 2010), and annotation efficiency in variations. Following (Palakodety et al., 2020a), low-resource settings is a well-established require- we normalize and average a document’s constituent ment (Joshi et al., 2019). Our work leverages word embeddings to yield the document embed- Lˆ (Palakodety et al., 2020a), a recently- polyglot ding. proposed method for noisy language identification Monolingual cluster discovery: Palakodety et al. that requires minimal supervision. We utilize it as (2020a) introduced a minimal supervision language a dependency to obtain weak labels and reduce an- detection method using polyglot Skip-gram em- notation burden and construct a substantially more beddings with sub-word information. These em- robust system. beddings discover monolingual subsets (clusters) Language identification: While language identi- in a multilingual corpus which are subsequently fication of well-formed text is a nearly-solved prob- retrieved using k-Means and a small sample per- lem, the difficulty in identifying language in a noisy cluster (10 documents) are annotated. We refer to social media setting is well-established (Bergsma this method as Lˆ and leverage it for construct- et al., 2012; Gella et al., 2014; Lui and Baldwin, polyglot ing our data set with minimal annotation burden. 2014; Jaech et al., 2016; Jauhiainen et al., 2019). For obvious reason, we do not compare Lˆ We see our work as a part of this continuing trend polyglot against our supervised solution that supports more and as an important resource contribution to ana- than 300 languages. In Section 7.5, we demon- lyze Indian social media. strate that our method detects languages with trace Romanized Indian Languages: In the context of presence in a corpus (< 1%), a known limitation processing Indian languages expressed on the web, of Lˆ (Palakodety et al., 2020a). challenges posed by the use of Roman script in- polyglot stead of the native script have been reported in 5 Method several recent studies in the context of code-mixed English-Bengali (Chanda et al., 2016), and English- Research question: How to construct an Hindi (Kumar et al., 2018) text. While addressing annotation-efficient language identification method word level language identification, (Gella et al., supporting a wide array of Indian languages? 2014) reported that 90% of posts in Indian lan- Our method has two main components: (i) an guages on Facebook are expressed in Roman script. annotation-efficient procedure to construct a sub- Prevalence of Romanized Hindi has also been previ- stantial data set with weak labels, (ii) a supervised ously reported in (Barman et al., 2014). Our study system trained on a data set comprising this corpus takes previous findings one small step forward with and an existing data set, Dtatoeba (Tatoeba, 2020), a (noisy) geographical analysis of Hindi web usage. a well-known annotated data set supporting 355 Bridging the resource gap: We also see our work languages (Tatoeba, 2020). For the construction of as a part of the ongoing effort in bridging the re- this data set, the election corpus (Palakodety et al., source gap between Indian languages and world 2020b) is stripped of all comments containing any languages (Vyas et al., 2014; Vijayakrishna and non English character. This maintains the focus Sobha, 2008; Kunchukuttan et al., 2014; Mohanty on Romanized Indian languages with the native et al., 2017; Joshi et al., 2020). variants sourced from Dtatoeba.

4 Background 5.1 Assigning Weak Labels In this section, we summarize a few key NLP mod- Algorithm1 outlines the steps in obtaining weak els and results critical to our methods. labels for nine Indian languages from multiple mul- Skip-gram embeddings: The Skip-gram model tilingual corpora and combining it with Dtatoeba. takes as input a word w ∈ W (vocabulary), and Our training data set is denoted by N1 ˆ N2 predicts words wc ∈ W that are likely to oc- D = {di, L(di)}i=1 ∪ {di, L(di)}i=1 where di is cur in the context of w. The training objective a document, L(.) returns a label annotated by a

183 Algorithm 1: FweakLabel({D1,..., Dn}) Initialization: D ← Dtatoeba foreach Di ∈ {D1,..., Dn} do Run Lˆpolyglot on Di Obtain clusters C1,..., CK using k-means s.t. |C1| ≥ |C2| ... ≥ |CK | Identify J ≤ K dominant clusters for (j = 1; j ≤ J; j = j + 1) do Assign language to Cj (denoted by L(Cj)), (supplied by the annotator) Sample γ|Cj| comments from Cj ranked by proximity from cluster center, 0 < γ ≤ 1 Add the sampled comments to D with weak label L(Cj) end end Output: Return D human, Lˆ(.) returns a weak label obtained from ing that a corpus is sourced from a region where Lˆpolyglot. a certain language is dominant allows us to select D is initialized with an annotated corpus, the appropriate annotators and reduces the annota- Dtatoeba (Tatoeba, 2020), i.e., N1 is the total number tion cost per document. We considered comments of samples present in Dtatoeba. Next, using Lˆpolyglot, obtained from the 14 national outlets as a sepa- our method obtains weak labels for N2 documents rate corpus. This led to 13 multilingual corpora from the election corpus and adds to D. (12 regional and 1 national). Hence, in our experi- Our method, FweakLabel(.), takes an array of ments, n, denoting the total number of corpora in Algorithm1, was set to 13. n multilingual corpora as inputs, runs Lˆpolyglot on each of them to obtain K language clusters, Parameter configuration: Our Algorithm has C ,..., C , such that |C | ≥ |C | ... ≥ |C |. J 1 K 1 2 K two configurable parameters: (1) j, the number largest of these clusters are selected and annota- of clusters selected per corpus for inclusion in the tors assign a language L(C ), 1 ≤ j ≤ J. For a j final data set, and (2) γ, the fraction of documents given cluster C , we obtain a set of pairs hd, Lˆ(d)i j per cluster chosen for inclusion. We set j to 2. The where d ∈ C , and Lˆ(d) = L(C ), i.e., the weak la- j j choice of j was guided by the intuition that English bel of the document is the cluster’s language label. is widely spoken in India and each state would For each of these J clusters, the top γ fraction are have at least one dominant regional language. Our chosen for inclusion into D. choice of γ was guided by an in-depth analysis of To summarize, for each multilingual corpus, Lˆ in the context of code switching (Khud- ˆ polyglot Lpolyglot is used to obtain the top monolingual clus- aBukhsh et al., 2020). The study revealed that ters, and a fraction of those are included with the documents closest to the cluster centers exhibit cluster language label into the data set. Each clus- strong monolinguality and those farther from the ter’s language label is assigned by labeling 10 doc- centers can exhibit code-switching or may even be uments in the cluster and thus the vast majority of authored in languages with trace presence. In order samples added to the data set is neither manually to obtain high quality weak labels, we set γ to 0.75. inspected nor labeled. Recall that, each of the regional news outlets Annotation Efficiency: Lˆpolyglot requires 10 an- we considered presents news in one of the domi- notated samples to assign a language label to a nant regional languages. We group all comments cluster (Palakodety et al., 2020a). Our method re- obtained from the news outlets of one particular quires 10jn annotated samples. Hence, our method state as one distinct corpus - i.e. each Di consists required 10×2×13 = 260 annotated comments to of comments posted in response to videos from a construct a corpus of 2,793,375 comments support- news outlet from a particular state. The choice of ing nine Indian languages. This is combined with treating each individual state’s corpus separately the Dtatoeba to yield D consisting of 11,042,839 contributes further to annotation efficiency - know- documents.

184 Predicted Label Predicted Label bn en gu hi kn ml mr or ta te ol bn en gu hi kn ml mr or ta te ol bn 100 0 0 0 0 0 0 0 0 0 0 bn 97 0 0 2 0 0 0 0 0 0 1 en 0 100 0 0 0 0 0 0 0 0 0 en 0 99 0 1 0 0 0 0 0 0 0 gu 0 0 100 0 0 0 0 0 0 0 0 gu 0 0 92 4 0 0 4 0 0 0 0 hi 0 0 0 100 0 0 0 0 0 0 0 hi 0 0 0 99 0 0 1 0 0 0 0 kn 0 0 0 0 99 0 0 0 0 0 1 kn 1 1 2 0 86 5 0 0 0 3 2 True Label ml 0 0 0 0 0 99 0 0 1 0 0 True Label ml 0 0 0 0 1 95 1 0 2 1 0 mr 0 0 0 0 0 0 100 0 0 0 0 mr 0 0 1 0 0 0 99 0 0 0 0 or 0 0 1 2 0 0 0 96 1 0 0 or 20 0 13 18 2 3 16 0 0 4 24 ta 0 0 0 0 0 0 0 0 100 0 0 ta 0 0 0 0 3 10 0 0 79 3 5 te 0 0 0 0 0 0 0 0 0 100 0 te 1 0 0 1 1 1 1 0 0 95 0 ol 0 0 0 0 0 0 0 0 0 0 0 ol 0 0 0 0 0 0 0 0 0 0 0

Table 2: Confusion matrix of performance evaluation Table 3: Confusion matrix of performance evaluation of Fend-to-end on 1000 annotated comments. For a given of GoogleLandID on 1000 annotated comments. For language, better or equal performance than the baseline a given language, better or equal performance than is highlighted with blue; ol denotes other languages. Fend-to-end is highlighted with blue; ol denotes other languages.

5.2 Learning with Weak Labels Once D is obtained, we train a classifier that takes not support romanized Indic languages (achieves as input a document, and predicts the language la- an overall accuracy 10% on our test set). We see bel. We provide an end-to-end model operating our paper as a resource paper that makes a small directly on the text and producing a language label. step forward in addressing the lack of linguistic The model utilizes a highly efficient text classi- resources for Indian social media analysis. fication framework introduced in (Joulin et al., Fairness: We first emphasize that the main pur- GoogleLangID 2017). The framework introduces a variety of opti- pose of comparing against is not mizations and is capable of classifying billions of to claim that our annotation-efficient solution is GoogleLangID documents in minutes without compromising on superior to across the board. accuracy (implementation details are presented in Our goal is rather to attract the research commu- nity’s attention to our solution’s effectiveness in the Appendix). We refer to this model as Fend-to-end. The model achieves comparable performance (test this under-explored, specific domain of noisy social accuracy > 98%) against a held out set that is not media texts generated in the Indian subcontinent. seen during any of the training phases. A fairer performance comparison between the two methods would require the methods to be trained 6 Experimental Setup on identical data sets and comparable computation budget. Due to these varying levels of resources, Test set: We construct an Indian language test set it is not possible to claim one method’s superior- consisting of 1,000 annotated YouTube comments ity over the other. We are rather highlighting our (consensus labels by two proficient annotators per method’s (1) annotation efficiency and (2) ability to language) in 10 languages (Bengali, English, Gu- extend support for newer languages (e.g., at present jarati, Hindi, Kannada, Malayalam, Marathi, Odia, GoogleLangID has limited support for Odia (or) Tamil, Telugu), and 100 documents in each lan- and no support for Assamese as). guage. These documents are randomly sampled from the output of FweakLabel and are never seen 7 Results and Analysis during our supervised training phase with weak labels. The average number of tokens in the com- Method Overall accuracy Excluding Odia ments is 22.6 ± 18.7. Please see Appendix for Fend-to-end 99.4 99.8 GoogleLangID 84.1 93.4 detailed statistics of our test data set. Baseline: We consider a commercial solu- Table 4: Performance comparison. Since it is unclear tion (Google) that can detect over 100 languages if GoogleLangID supports Odia (or), the left-most including the 10 mentioned above (referred to as column presents performance excluding Odia from the GoogleLangId) as our baseline. In addition to test set. GoogleLangId, Palakodety et al.(2020a) com- pared against FastTextLangID. We avoid this Table4 summarizes the performance compar- comparison because we feel it is unfair to com- ison between our proposed classifier and the pare against FastTextLangID given that it does GoogleLangID baseline. Our results indicate

185 that our method considerably outperforms the base- performance of 92% accuracy on identifying As- line which is a well-known commercial solution. A samese while retaining our previous performance closer look at the individual performance of each on every other language. Assam has been a language (confusion matrices presented in2 and Ta- center for political debates and unrest in recent ble3) reveals that across all languages, our method times (BBC). Our resource to detect Romanized performs equal or better than GoogleLangID. Assamese can be a vital tool which to the best of our Although the performance gap primarily stems knowledge, does not exist. Details are presented in from our method’s overwhelmingly stronger per- the Appendix. formance in Odia (or), we perform considerably better than GoogleLangID even if we exclude 7.2 Domain-robustness Odia from the test data set. Our goal is to present an important Indian NLP Our performance comparison highlights the resource that can perform well across multiple so- following two points. First, we reiterate that cial media platforms. Hence, it is paramount that our goal is not to claim that our method would our system generalizes well both to in domain and perform better than GoogleLangID across the out of domain instances. In the context of the task board, but rather, to demonstrate the value our of language identification, domain adaption has method adds in processing noisy social media received recent attention (Li et al., 2018). In this texts. It is possible that our method is more section, we present an analysis on our system’s out attuned to noisy short social media texts while of domain performance. GoogleLangID could be (possibly) trained on Data set of Hinglish tweets: We consider a data cleaner corpora which explains our method’s set of tweets introduced in Mathur et al.(2018). stronger performance. This is further corroborated The data set consists of 3,189 tweets written in En- by the fact that even our mispredictions (barring glish (en), Romanized Hindi (hi) and code-mixed one) remain confined to the regional languages English-Hindi (en-hi). We construct a randomly while many of GoogleLangID’s mispredictions sampled data set of 100 tweets with equal propor- are distributed across other languages (ol). Second, tion of Hindi and English tweets (consensus labels GoogleLangID’s weak performance in detect- obtained from two annotators). ing Odia highlights the gap in current solutions As shown in Table5, our system’s out of domain and shows how our method can effectively and performance was consistent with its in domain per- 2 efficiently address these issues . formance. We performed marginally better than Recall that Dtatoeba contains a large set of lan- GoogleLangID. We admit that a more robust guages (including the native script versions of the test on multiple data sets comprising content from Indian languages considered in this paper). Test ac- a larger set of Indian languages from other social curacy on a held-out set was 98.4%. Experiments media platforms would further validate our out of reveal that introduction of the weakly labeled cor- domain performance. However, our current exper- pus does not impact performance on Dtatoeba (iden- iment indicates that our system’s success is not tical test accuracy of 98.4%). limited to YouTube comment texts, it can general- ize to tweets as well. 7.1 Extensibility We constructed a new data set of comments on 7.3 Usage Statistics YouTube videos from an Assamese news chan- As demonstrated, our integrated setup covers the nel (News18 Assam/Northeast) and used the same Romanized and native script variants of the most Lˆ approach of using polyglot to obtain weak labels prevalent Indian languages. This enables us to for Romanized Assamese. We do not show a di- rect comparison with GoogleLangID because GoogleLangID does not support Assamese (as). Method Accuracy Language P R F1 Fend-to-end 0.98 en 1.00 1.00 1.00 However, on an augmented test data set of 1,100 hi 0.96 0.96 0.96 GoogleLangID 0.95 en 1.00 1.00 1.00 comments (100 Assamese comments with consen- hi 1.00 0.90 0.95 sus labels from two annotators), we achieved a

2 Table 5: Performance comparison on the tweet data set. Odia is listed as one of the supported languages by Best metric is highlighted in bold for each language. P: GoogleLangID, it is unclear if this tool supports Roman- ized Odia. Assamese is not supported by GoogleLangID. precision, R: recall.

186 State hi hiN hi ∩ hiN the resulting choropleth is visualized in Figure 1(b). Andhra Pradesh 1.66% 0.05% 1.71% Bihar 67.97% 14.29% 82.26% We also provide in Figure 1(a), the region referred Gujarat 24.23% 3.41% 27.64% to as the Hindi belt (Jaffrelot, 2000) where Hindi is Karnataka 1.85% 0.02% 1.87% the first language of the bulk of the population. We Kerala 0.48% 0.02% 0.5% Madhya Pradesh 76.21% 10.39% 86.60% observe a strong correlation of the estimated geo- Maharashtra 7.46% 4.18% 11.64% graphic extent of Hindi with the Hindi belt states. Odisha 9.20% 0.02% 9.22% In Table6, we list the state-wise estimates of Rajasthan 58.48% 29.90% 88.38% Tamil Nadu 0.22% 0.01% 0.23% Hindi usage in our corpus. Our findings are consis- Uttar Pradesh 63.56% 22.23% 85.79% tent with existing knowledge of Hindi’s geographic 4.72% 0.18% 4.90% spread. In the Hindi belt states, more than 80% of the comments were authored in Hindi. Consistent Table 6: Presence of Hindi. hi is Romanized Hindi. with census data (Census, 2011) and prior liter- hiN is Devanagari Hindi. Hindi belt states (Jaffrelot, ature (Ramaswamy, 1997), the fraction of Hindi 2000) are highlighted with blue. comments discovered in the Tamil Nadu origin sub- set was minuscule. explore research questions on the usage patterns of Romanized vs Devanagari: As shown in Table6, these Indian languages. This part of our analysis is the ratio of hi and hiN usage reveals that a vast conducted on the entire election corpus containing majority of internet users eschew the traditional comments written in all scripts. Devanagari script and instead use Roman script. Weak geo-labels: Recall that, all of the 24 regional However, the ratio of Roman script to Devanagari news outlets we consider present news in the domi- script is substantially less lopsided in the Hindi nant language of their respective states. Hence, it is belt states than in the other states. Our studies are reasonable to assume that a considerable fraction of consistent with Gella et al.(2014). users consuming the regional news and participat- Estimating bilinguality: We conduct a user- ing in the comments section have some affiliation focused study by computing language usage statis- to the region (state). Thus, a comment posted in tics on a per-user basis. We assume that a user is response to a regional news outlet’s video can be proficient in a language, L, if she posts two or more assigned a weak/noisy geographic label - the state comments (in order to accommodate for some es- targeted by the news network. For instance, we timation error) in L. If a user is estimated to be assume that a comment posted in response to a proficient in two languages, then we label her as Tamil news video clip, is likely to be authored by bilingual. Romanized and native script comments someone who either resides in or retains strong are both considered to be an equal demonstration of ties to Tamil Nadu. Combining these weak/noisy proficiency in a given language. We acknowledge geographic labels with our language identification that this is at best a noisy estimate. system, we can assess the geographic distribution Out of 159,993 total users, 41,776 users (26.1%) of language use in India. Note that, these results were marked as bilinguals using Fend-to-end. Ac- are only approximate estimates - YouTube com- cording to the 2011 census (Census, 2011), 26% ments do not contain any geographic information. of the Indian population are bilinguals. Hence, sur- Further, it is also not possible to estimate a user’s prisingly, our noisy estimate was reasonably close knowledge of other languages through our analysis. to the ground truth. We observe that over 70% For example, if a user comments solely in Hindi, it of the discovered bilinguals in our corpus used is not possible to assess their fluency in English or Hindi-English. A detailed plot is presented in the Bengali. Appendix.

7.4 Hindi Web Usage 7.5 Trace Language Detection Geographic extent of Hindi usage: We label We conclude this section with an analysis on (1) each comment with the language prediction from to what extent we address Lˆpolyglot’s inability to Fend-to-end and a geographic label corresponding to detect trace languages, and (2) why it could be the origin state of the news outlet. All comments worth addressing. Our definition of trace language posted in Romanized and Devanagari Hindi (de- is corpus-specific. We consider a language L to be noted by hi and hiN , respectively) are retained and a trace language in a corpus D if fewer than 1% 187 (a) Hindi Belt States (b) Total Hindi Usage per State (c) Legend

Figure 1: Choropleths of Hindi usage patterns in India. (a) shows the geographic region identified as the Hindi belt. (b) shows the patterns of Hindi usage. We extend the results for Andhra Pradesh to Telangana, and Bihar to Jharkhand because the same news networks cater to both states. The base maps used for this plot are sourced from the Government of India. The authors are aware that these maps include disputed territories. These maps do not constitute judgments on existing disputes.

Language Data set Example comment Loose translation bn Dhope Indian air force k varote ferea Thank you Pak army for returning our Indian Air deoar jonno osonkho dhonnobad Pak Force (pilot). armyke te Dhope yudham vasthey manam kuda chala We’ll suffer a lot too in the event of war... nastha potham... bn Dhelp Tawhid ami rohiggha dhar help Tawhid, I want to help the Rohingyas. Please sug- korte chai plz amaka hepl korar gest me some ways I can help them. I live outside moto kisu opai bolo bz ami of the country. dhashar bahira thaki te Dhelp manama hinduvulama muslim Are we Hindu, Muslim, Christian - this is irrelevant, christans ani kadu manushulama are we human - this is relevant; if we were in their aannadi kavali manamu valla position, we would understand paristutulalo unte telustundi bn DCOVID Amra 500jon Lok Bangalore atkie We are 500 people stuck in Bangalore. Please make Achi please amader Bari. Pouche some arrangements so that we can reach our home din ,amader Bari west Bengal in West Bengal, India. India te DCOVID Please sir lo pettandi Please sir, post on Twitter, I’m a woman with 2 please nenoka mahilalu 2 chinna small kids, we’re stuck in Jammikunta, please help pillalu unnaru Sir mem jammikunta us get to Nijambad lo undipoyamu nijamabad cherela chudandi Sir

Table 7: Random sample of comments in trace language detected by our system.

of the documents in D are authored in L. In this As reported in Palakodety et al.(2020a), Lˆpolyglot section, we focus on the following three corpora discovered three clusters in Dhope: (1) en, (2) hi and one of which (DCOVID) we introduce here: (3) hiN ; no other languages were detected. How- 1. Dhope: 2.04 million YouTube comments rele- ever, in our experiments, we found presence of vant to the 2019 India-Pakistan conflict (Palakodety multiple trace languages. For instance, overall, our et al., 2020a). method Fend-to-end found 3,373 Telugu (te) and 205 2. Dhelp: 263k YouTube comments relevant to the Bengali (bn) comments in Dhope. Human annota- Rohingya refugee crisis (Palakodety et al., 2020c). tion of randomly sampled 100 comments in each of the two languages revealed a precision of 100% 3. DCOVID: 777,748 comments from 5,301 videos from two highly popular Indian news channels and 97% for te and bn, respectively. Similarly, we (NDTV and ) posted between 30th Jan- conducted a search for bn and te comments in Dhelp, uary, 20203 and 10th April, 2020. and found 1,251 and 146 comments, respectively. Human annotation on randomly sampled 100 com- 3First COVID-19 positive case was reported in India on ments from each of the two languages yielded this day.

188 precision of 99% for both bn and te. In Table7, Arunavha Chanda, Dipankar Das, and Chandan we list example peace-seeking, hostility-diffusing Mazumdar. 2016. Unraveling the English-Bengali comments (hope speech) and comments indicating code-mixing phenomenon. In Proceedings of the Second Workshop on Computational Approaches to support for the disenfranchised Rohingyas (help Code Switching, pages 80–89, Austin, Texas. Asso- speech). ciation for Computational Linguistics. Finally, when F is run on DCOVID, we end-to-end Monojit Choudhury, Kalika Bali, Tirthankar Dasgupta, discover comments in several languages requesting and Anupam Basu. 2010. Resource creation for assistance during the nationwide lockdown (BBC, training and testing of transliteration systems for in- 2020). Our method reveals the presence of vulner- dian languages. LREC. able individuals who express themselves in low- Matteo Cinelli, Walter Quattrociocchi, Alessandro resource languages. We hope our tool can open the Galeazzi, Carlo Michele Valensise, Emanuele Brug- gates for research in this humanitarian domain. noli, Ana Lucia Schmidt, Paola Zola, Fabiana Zollo, and Antonio Scala. 2020. The covid-19 social media 8 Conclusion infodemic. In this paper, we present a language identification Ma Dewen. 2010. A study on the two waves of states- tool with a focus on nine major Romanized Indian reorganization in india [j]. South Asian Studies Quarterly, 1. languages. Despite the widespread use of Roman- ization on social media, NLP resources and tools Pinar Donmez, Jaime Carbonell, and Jeff Schneider. often focus more on the native scripts. Our tool 2010. A probabilistic framework to learn from mul- integrates with an existing large-scale corpus and tiple annotators with time-varying accuracy. In Pro- ceedings of the 2010 SIAM international conference holds promise in being a valuable resource for In- on data mining, pages 826–837. SIAM. dian social media analysis. Our pipeline leverages a recent NLP algorithm and obtains weak labels for FastText. FastTextLangID. [Online; accessed 3-June- 2020]. a large number of samples substantially reducing the annotation cost. Finally, we conduct studies Spandana Gella, Kalika Bali, and Monojit Choudhury. on the geographic extent, bilinguality, and Roman- 2014. “ye word kis lang ka hai bhai?” testing the ization of Hindi and observe that these align with limits of word level language identification. In Pro- ceedings of the 11th International Conference on existing studies and surveys. Natural Language Processing, pages 368–377.

Google. GoogleLangID. [Online; accessed 3-June- References 2020]. Utsab Barman, Amitava Das, Joachim Wagner, and Jen- HindustanTimes. 2019. Youtube now has 265 million nifer Foster. 2014. Code mixing: A challenge for users in india. Online; accessed 3-June-2020. language identification in the language of social me- dia. In Proceedings of the first workshop on compu- Aaron Jaech, George Mulcaire, Shobhit Hathi, Mari tational approaches to code switching, pages 13–23. Ostendorf, and Noah A. Smith. 2016. Hierarchi- cal character-word models for language identifica- BBC. Why has india’s assam erupted over an ’anti- tion. In Proceedings of The Fourth International muslim’ law? Online; accessed 12-May-2020. Workshop on Natural Language Processing for So- BBC. 2020. Coronavirus: India’s pandemic lockdown cial Media, pages 84–93, Austin, TX, USA. Associ- turns into a human tragedy. Online; accessed 3- ation for Computational Linguistics. June-2020. Christophe Jaffrelot. 2000. The rise of the other back- Shane Bergsma, Paul McNamee, Mossaab Bagdouri, ward classes in the hindi belt. The Journal of Asian Clayton Fink, and Theresa Wilson. 2012. Language Studies, 59(1):86–108. identification for creating language-specific twitter collections. In Proceedings of the second workshop Tommi Sakari Jauhiainen, Marco Lui, Marcos on language in social media, pages 65–74. Associa- Zampieri, Timothy Baldwin, and Krister Linden.´ tion for Computational Linguistics. 2019. Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, Piotr Bojanowski, Edouard Grave, Armand Joulin, and 65:675–782. Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Associa- Pratik Joshi, Christain Barnes, Sebastin Santy, Simran tion for Computational Linguistics, 5:135–146. Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choud- Census. 2011. 2011 census data. Online; accessed 3- hury, and Kalika Bali. 2019. Unsung challenges June-2020. of building and deploying language technologies for

189 low resource language communities. arXiv preprint Shriphani Palakodety, Ashiqur R. KhudaBukhsh, and arXiv:1912.03457. Jaime G. Carbonell. 2020a. Hope speech detection: A computational analysis of the voice of peace. In Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika ECAI 2020 - 24th European Conference on Artificial Bali, and Monojit Choudhury. 2020. The state and Intelligence, volume 325 of Frontiers in Artificial In- fate of linguistic diversity and inclusion in the nlp telligence and Applications, pages 1881–1889. IOS world. arXiv preprint arXiv:2004.09095. Press. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Shriphani Palakodety, Ashiqur R. KhudaBukhsh, and Tomas Mikolov. 2017. Bag of tricks for efficient text Jaime G. Carbonell. 2020b. Mining insights from classification. In Proceedings of the 15th EACL: Vol- large-scale corpora using fine-tuned language mod- ume 2, Short Papers, pages 427–431. els. In ECAI 2020 - 24th European Conference on Ashiqur R. KhudaBukhsh, Shriphani Palakodety, and Artificial Intelligence, volume 325 of Frontiers in Jaime G. Carbonell. 2020. Harnessing code switch- Artificial Intelligence and Applications, pages 1890– ing to transcend the linguistic barrier. In Proceed- 1897. IOS Press. ings of the Twenty-Ninth International Joint Confer- Shriphani Palakodety, Ashiqur R. KhudaBukhsh, and ence on Artificial Intelligence, IJCAI 2020, pages Jaime G. Carbonell. 2020c. Voice for the voiceless: 4366–4374. ijcai.org. Active sampling to detect comments supporting the Upendra Kumar, Vishal Singh, Chris Andrew, San- rohingyas. In The Thirty-Fourth AAAI Conference thoshini Reddy, and Amitava Das. 2018. Consonant- on Artificial Intelligence, AAAI 2020, pages 454– vowel sequences as subword units for code-mixed 462. languages. In Thirty-Second AAAI Conference on Artificial Intelligence. Sumathi Ramaswamy. 1997. Passions of the tongue: Language devotion in Tamil India, 1891–1970, vol- Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatter- ume 29. Univ of California Press. jee, Ritesh Shah, and Pushpak Bhattacharyya. 2014. Sata-anuvadak: Tackling multiway translation of in- Rupak Sarkar, Sayantan Mahinder, Hirak Sarkar, and dian languages. pan, 841(54,570):4–135. Ashiqur R. KhudaBukhsh. 2020. Social media attri- butions in the context of water crisis. In Empirical Yitong Li, Timothy Baldwin, and Trevor Cohn. 2018. Methods in Natural Language Processing (EMNLP), What’s in a domain? learning domain-robust text 2020, page to appear. representations using adversarial training. In Pro- ceedings of the 2018 Conference of the North Amer- Statista. 2020. Most popular social networks world- ican Chapter of the Association for Computational wide as of january 2020, ranked by number of active Linguistics: Human Language Technologies, Vol- users. Online; accessed 3-June-2020. ume 2 (Short Papers), pages 474–479, New Orleans, Louisiana. Association for Computational Linguis- Tatoeba. 2020. Tatoeba. Online; accessed 3-June- tics. 2020. Marco Lui and Timothy Baldwin. 2014. Accurate lan- R Vijayakrishna and L Sobha. 2008. Domain focused guage identification of twitter messages. In Proceed- named entity recognizer for tamil using conditional ings of the 5th workshop on language analysis for random fields. In Proceedings of the IJCNLP-08 social media (LASM), pages 17–25. Workshop on Named Entity Recognition for South and South East Asian Languages. Puneet Mathur, Ramit Sawhney, Meghna Ayyar, and Rajiv Shah. 2018. Did you offend me? classification Paola Virga and Sanjeev Khudanpur. 2003. Transliter- of offensive tweets in Hinglish language. In Pro- ation of proper names in cross-lingual information ceedings of the 2nd Workshop on Abusive Language retrieval. In Proceedings of the ACL 2003 work- Online (ALW2), pages 138–148, Brussels, Belgium. shop on Multilingual and mixed-language named en- Association for Computational Linguistics. tity recognition-Volume 15, pages 57–64. Associa- tion for Computational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- Yogarshi Vyas, Spandana Gella, Jatin Sharma, Kalika tations in vector space. In 1st International Confer- Bali, and Monojit Choudhury. 2014. POS tagging ence on Learning Representations. of English-Hindi code-mixed social media content. In Proceedings of the 2014 Conference on Empirical Tom M Mitchell. 2004. The role of unlabeled data in Methods in Natural Language Processing (EMNLP), supervised learning. In Language, Knowledge, and pages 974–979, Doha, Qatar. Association for Com- Representation, pages 103–111. Springer. putational Linguistics. Gaurav Mohanty, Abishek Kannan, and Radhika YourStory. 2018. Youtube monthly user base touches Mamidi. 2017. Building a sentiwordnet for odia. 265 million in india, reaches 80 pc of internet popu- In Proceedings of the 8th Workshop on Computa- lation. Online; accessed 3-June-2020. tional Approaches to Subjectivity, Sentiment and So- cial Media Analysis, pages 143–148.

190 9 Appendix Predicted Label as bn en gu hi kn ml mr or ta te ol as 92 3 0 1 2 0 0 0 2 0 0 0 9.1 Annotation bn 0 100 0 0 0 0 0 0 0 0 0 0 en 0 0 100 0 0 0 0 0 0 0 0 0 All annotations are performed by two native speak- gu 0 0 0 100 0 0 0 0 0 0 0 0 ers of each of the languages we considered. All hi 0 0 0 0 100 0 0 0 0 0 0 0 kn 1 0 0 0 0 99 0 0 0 0 0 0 labels are consensus labels. True Label ml 0 0 0 0 0 0 99 0 0 1 0 0 mr 0 0 0 0 0 0 0 100 0 0 0 0 9.2 Detailed Performance with Assamese or 0 0 0 1 2 0 0 0 96 1 0 0 ta 0 0 0 0 0 0 0 0 0 100 0 0 Our data set was crawled using publicly available te 0 0 0 0 0 0 0 0 0 0 100 0 ol 0 0 0 0 0 0 0 0 0 0 0 0 YouTube API on the YouTube channel of CNN News18 Assam/Northeast. Overall, we obtained Table 9: Confusion matrix of performance evaluation 66,923 comments from 7,170 videos of which weak of Fend-to-end on 1,100 annotated comments; ol denotes labels (4,337 English and 21,411 Assamese) were other languages. obtained using Lˆpolyglot. We augmented our previ- ous training set with these obtained (weak labels) 9.5 Language pairs used by bilinguals comments. The detailed performance is presented in Table9. Figure2 summarizes the relative distribution of language pairs in our bilingualism estimation ex- 9.3 Classification Framework periment. Results show that Hindi-English bilin- gualism is the most dominant one. State Channels Telugu Andhra Pradesh State Comment length TV9 Telugu Live as 17.03 ± 19.95 News18 Bihar Jharkhand Bihar bn ± ZeeBiharJharkhand 25.19 18.33 en ± TV9 Gujarati 31.52 27.04 Gujarat ABP Asmita gu 23.82 ± 20.91 TV9 Kannada hi 24.72 ± 15.33 Karnataka kn 17.85 ± 10.57 Asianetnews ml 19.91 ± 11.92 Kerala mr 25.73 ± 21.88 News18 MP Chhattisgarh or 13.34 ± 11.72 Madhya Pradesh Zee Madhya Pradesh Chhattisgarh ta 21.98 ± 14.59 ABP Majha te 22.05 ± 20.95 Maharashtra OTV Odisha Table 10: Statistics of test data set. Rajasthan ZeeRajasthanNews Puthiyathalaimurai TV Tamil Nadu Polimer News News18 UP Uttarakhand Uttar Pradesh Zee Uttarpradesh Uttarakhand ABP ANANDA West Bengal

Table 8: Regional channels.

The classification framework we use (Joulin et al., 2017), contains a variety of optimizations focused on text classification - an architecture that enables parameter sharing, and efficient techniques to include token n-grams. The inference phase is able to process and label over 10 million documents Figure 2: The top language pairs used by bilinguals in our corpus. Hindi and English feature prominently in all the in under five minutes (wall clock time). language pairs. 9.4 Test data set details IndiaTV, NDTV India, Republic World, The Times of India, Zee News, 100 comments are randomly sampled for each of , ABP NEWS, CNN-News18, , NDTV, TIMES the 10 languages (bn, en, gu, hi, kn, ml, mr, or, ta, NOW, India Today, , Hindustan Times te). The average number of tokens in the comments is 22.6 ± 18.7. A language-wise breakdown is Table 11: National channels. presented in Table 10.

191 9.6 List of YouTube channels The YouTube channels considered in Palakodety et al.(2020b) are listed in Table8 and 11.

192