Annotation Efficient Language Identification from Weak Labels

Annotation Efficient Language Identification from Weak Labels Shriphani Palakodety∗ Ashiqur R. KhudaBukhsh∗ Onai Carnegie Mellon University [email protected] [email protected] Abstract the ongoing COVID-19 pandemic, demand for linguistic resources might get recalibrated; informa- India is home to several languages with more tion processing objectives might be heavily tilted than 30m speakers. These languages exhibit towards under-served languages that are prevalent significant presence on social media platforms. in many densely populated regions. However, several of these widely-used lan- language identification guages are under-addressed by current Natu- In this paper, we focus on ral Language Processing (NLP) models and in noisy, social media settings - a basic and highly resources. User generated social media con- critical linguistic resource prerequisite for down- tent in these languages is also typically au- stream analysis in a multilingual environment. Our thored in the Roman script as opposed to solution extends support for nine major Indian lan- the traditional native script further contribut- guages (see, Table1) spanning the native tongues of ing to resource scarcity. In this paper, we 85% of India’s population (Census, 2011). These leverage a minimally supervised NLP tech- under-resourced languages are heavily used in sev- nique to obtain weak language labels from a large-scale Indian social media corpus leading eral densely populated travel hubs and on social to a robust and annotation-efficient language- media. User generated web content in these lan- identification technique spanning nine Roman- guages is typically authored in the Roman script as ized Indian languages. In fast-spreading pan- opposed to the traditional native script leading to demic situations such as the current COVID- scarcer linguistic resources (Virga and Khudanpur, 19 situation, information processing objectives 2003; Choudhury et al., 2010; Gella et al., 2014; might be heavily tilted towards under-served Barman et al., 2014; Palakodety et al., 2020a). Ex- languages in densely populated regions. We release our models to facilitate downstream anal- isting large-scale language identification tools pri- yses in these low-resource languages1. Exper- oritize the languages’ native scripts (e.g., (FastText; iments across multiple social media corpora Google)) over the Romanized variants. Our solu- demonstrate the model’s robustness and pro- tion focuses on the these Romanized variants and vide several interesting insights on Indian lan- is integrated with a widely used existing language- guage usage patterns on social media. We re- identification system (FastText) supporting 355 lan- lease an annotated data set of 1,000 comments guages. We release our open-source language in ten Romanized languages as a social media identification system1. to facilitate Indian social evaluation benchmark1. media analysis. 1 Introduction Annotator availability is a major concern that may constrain data acquisition efforts in low re- Much of the current NLP research focuses on a source settings (Joshi et al., 2019). Our proposed handful of world languages (e.g., English, French, solution is extremely annotation efficient; it uti- Spanish etc.). They enjoy substantially larger com- lizes a recent result (Palakodety et al., 2020a) to au- putational linguistic resources as compared to their tomatically group a multilingual corpus into largely low-resource counterparts (e.g., Bengali, Odia etc.). monolingual clusters that can be extracted with However, in the midst of global-scale events like minimal supervision. Using a mere 260 annotated short documents (YouTube video comments), we ∗ Shriphani Palakodety and Ashiqur R. KhudaBukhsh are equal-contribution first authors. Ashiqur R. KhudaBukhsh is 1Resources are available at: https://www.cs.cmu. the corresponding author. edu/ãkhudabu/IndicLanguage.html. 181 Proceedings of the 2020 EMNLP Workshop W-NUT: The Sixth Workshop on Noisy User-generated Text, pages 181–192 Online, Nov 19, 2020. c 2020 Association for Computational Linguistics assign weak labels to a data set of 2.8 million com- Indian General Election. ments spanning the aforementioned languages. Our model performs favorably when compared against Why YouTube? As of January 2020, YouTube is an existing commercial solution. the second-most popular social media platform in the world drawing 2 billion active users (Statista, While census data and surveys can provide 2020). YouTube is the most popular social media useful information about linguistic diversity and platform in India with 265 million monthly active spread, analyses of user-generated multi-lingual users (225 million on mobile), accounting for 80% corpora can complement these surveys with addi- of the population with internet access (Hindustan- tional useful insights of their own. We conduct a Times, 2019; YourStory, 2018). YouTube video focused analysis to explore if the (estimated) dis- comments have been used as data sources to an- tribution of web-usage of Hindi across different alyze recent important events (Palakodety et al., Indian states aligns with common knowledge. Our 2020a,c; Sarkar et al., 2020; Cinelli et al., 2020). analysis indicates that Hindi’s web-presence is con- siderably higher in a cluster of North Indian states referred to as the Hindi belt (Jaffrelot, 2000) as Language ISO code First language compared to the South Indian states. We further speakers analyze similar research questions concerning the Bengali bn 8.03% relative usage of the Roman script and the native Gujarati gu 4.58% Hindi hi 43.63% script for Hindi. We finally conclude with a small Kannada kn 3.61% exploratory study on our method’s effectiveness in Marathi mr 6.86% detecting languages with trace presence in multiple Malayalam ml 2.88% Odiya or 3.10% corpora and outline some of the possible utilities. Tamil ta 5.70% Contributions: Our main contributions of the pa- Te te 6.70% per are the following: Table 1: List of languages we considered with their cor- • Resource: We release an important linguistic responding ISO 693-1 codes and first language speak- resource to detect nine heavily-spoken Indic lan- ers as percentage of Indian population. Data is col- guages expressed in Roman script. lected from 2011 census (Census, 2011). • Method: We propose an annotation efficient method to construct this language identifier and demonstrate extensibility. • Linguistic: We conduct a web-scale analysis Why this data set? The data set considers two of Hindi usage shedding light on multilinguality, highly popular YouTube news channels for each geographic spread, and usage patterns. of the 12 Indian states that contribute 20 or more • Social: We outline how our tool can detect trace seats in the lower house of the parliament. State presence of other languages that can aid in con- boundaries in India were drawn along linguistic structing data sets for humanitarian challenges. lines (Dewen, 2010). The dominant regional language in the Hindi belt (Jaffrelot, 2000) is Hindi, 2 Data Set: YouTube Video Comments and the other states feature a unique dominant language written in either the Latin alphabet (in in- In order to construct our language identification formal settings) or a native script. All the nine system, we would require a web-scale Indian so- languages we focused on (listed in Table1), are cial media data set that (i) has considerable pres- the dominant language in one or more of these 12 ence of the nine languages we are interested in, states. The regional news networks considered pro- and (ii) captures a representative fraction of the vide coverage in the dominant regional language. Indian web users. To achieve this two-fold goal, Hence, the data set exhibits strong presence of all we consider a data set introduced in Palakodety the nine regional languages we are interested in. et al.(2020b) to analyze the 2019 Indian General In addition to these 24 regional news channels, Election. The data set consists of comments on the data set considers YouTube channels for 14 YouTube videos hosted by popular news outlets in highly popular national news outlets (listed in the India. Overall, the corpus consists of 6,182,868 Appendix). Overall, this implies 38 YouTube chan- comments on 130,067 videos by 1,518,077 users nels (24 regional, 14 national) with an average sub- posted in a 100 day period leading up to the 2019 scriber count of 3,338,628. 182 3 Related Work (predicting an input word’s context) is parameter- ized by real-valued word representations or em- Learning from weak labels: The role of unla- beddings (Mikolov et al., 2013). Bojanowski et al. beled and weakly (or noisily) labeled data in su- (2017) introduced sub-word extensions to the Skip- pervised learning is a well-studied problem and gram model to learn robust word representations has received sustained focus (Mitchell, 2004; Don- even in the presence of misspellings or spelling mez et al., 2010), and annotation efficiency in variations. Following (Palakodety et al., 2020a), low-resource settings is a well-established require- we normalize and average a document’s constituent ment (Joshi et al., 2019). Our work leverages word embeddings to yield the document embed- L^ (Palakodety et al., 2020a), a recently- polyglot ding. proposed method for noisy language identification Monolingual cluster discovery: Palakodety et al. that requires minimal supervision. We utilize it as (2020a) introduced a minimal supervision language a dependency to obtain weak labels and reduce an- detection method using polyglot Skip-gram em- notation burden and construct a substantially more beddings with sub-word information. These em- robust system. beddings discover monolingual subsets (clusters) Language identification: While language identi- in a multilingual corpus which are subsequently fication of well-formed text is a nearly-solved prob- retrieved using k-Means and a small sample per- lem, the difficulty in identifying language in a noisy cluster (10 documents) are annotated. We refer to social media setting is well-established (Bergsma this method as L^ and leverage it for construct- et al., 2012; Gella et al., 2014; Lui and Baldwin, polyglot ing our data set with minimal annotation burden.

Load more