Towards Offensive Language Identification for Dravidian Languages
Total Page:16
File Type:pdf, Size:1020Kb
Towards Offensive Language Identification for Dravidian Languages Siva Sai Yashvardhan Sharma Birla Institute of Technology and Birla Institute of Technology and Science Pilani, India. Science Pilani, India. f20170779@pilani. yash@pilani. bits-pilani.ac.in bits-pilani.ac.in Abstract In a country like India with multiple native lan- guages, users prefer to use their regional language Offensive speech identification in countries like India poses several challenges due to the in their social media interactions (Thavareesan and usage of code-mixed and romanized variants Mahesan, 2019, 2020a,b). It has also been iden- of multiple languages by the users in their tified that users tend to use roman characters for posts on social media. The challenge of of- texting instead of the native script. This poses a fensive language identification on social me- severe challenge for the identification of offensive dia for Dravidian languages is harder, con- speech, considering the under-developed method- sidering the low resources available for the ologies for handling code-mixed and romanized same. In this paper, we explored the zero- text (Jose et al., 2020; Priyadharshini et al., 2020). shot learning and few-shot learning paradigms based on multilingual language models for of- Until a few years ago, hate and offensive speech fensive speech detection in code-mixed and ro- were identified manually which is now an impos- manized variants of three Dravidian languages sible task due to the enormous amounts of data - Malayalam, Tamil, and Kannada. We pro- being generated daily on social media platforms. pose a novel and flexible approach of selec- The need for scalable, automated methods of hate tive translation and transliteration to reap bet- speech detection has attracted significant research ter results from fine-tuning and ensembling multilingual transformer networks like XLM- from the domains of natural language processing RoBERTa and mBERT. We implemented pre- and machine learning. A variety of techniques trained, fine-tuned, and ensembled versions of and tools like bag of words models, N-grams, XLM-RoBERTa for offensive speech classifi- dictionary-based approaches, word sense disam- cation. Further, we experimented with inter- biguation techniques are developed and experi- language, inter-task, and multi-task transfer mented with by researchers. Recent developments learning techniques to leverage the rich re- in multilingual text classification are led by Trans- sources available for offensive speech identi- former architectures like mBERT (Devlin et al., fication in the English language and to enrich the models with knowledge transfer from re- 2018) and XLM-RoBERTa (Conneau et al., 2019). lated tasks. The proposed models yielded good An additional advantage of these architectures, par- results and are promising for effective offen- ticularly XLM-RoBERTa, is that it yields good sive speech identification in low resource set- results even with lower resource languages and this tings.1 particular aspect is beneficial to Indian languages 1 Introduction which do not have properly established datasets. In our work, we focused on using these architectures Offensive speech is defined as speech that causes in multiple ways. However, there is a caveat in a person to feel upset, resentful, annoyed, or in- directly using the models on the romanized or code- sulted. In recent years, social media such as Twit- mixed text: the transformer models are trained on ter, Facebook and Reddit have been increasingly languages in their native script, not in the roman- used for the propagation of offensive speech and ized script in which users prefer to write online. We the organization of hate and offense-based activi- solve this problem by using a novel way to convert ties (Mandl et al., 2020; Chakravarthi et al., 2020e). the romanized sentences into their native language 1https://github.com/SivaAndMe/TOLIDL_ while preserving their semantic meaning - selective DravidianLangTech_EACL_2021 translation and transliteration. 18 Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 18–27 April 20, 2021 ©2021 Association for Computational Linguistics This work is an extension of (Sai and Sharma, used an ensemble of BERT models with different 2020). Our important contributions are as follows random seeds for aggression detection on social 1) Proposed selective translation and transliteration media text, which is the inspiration behind our en- for text conversion in romanized and code-mixed sembling strategy with XLM-RoBERTa models. settings which can be extended to other romanized There has been less research on text classification and code-mixed contexts in any language. 2) Exper- in Dravidian languages and there is no research in imented and analyzed the effectiveness of finetun- offensive speech identification for Dravidian lan- ing and ensembling of XLM-RoBERTa models for guages so far. (Thomas and Latha, 2020) uses a offensive speech identification in code-mixed and simple LSTM model for sentiment analysis in the romanized scripts. 3) Investigated the efficacy of Malayalam language. Our work adresses this gap inter-language, inter-task, and multi-task transfer of less research in offensive speech identification learning techniques and demonstrated the results methods for Dravidian languages and the systems with t-SNE (Maaten and Hinton, 2008) visualiza- proposed can be extended to other Indian and for- tions. eign languages as well. The rest of the paper is organized as follows: In section 2, we discuss related work followed by 3 Datasets datasets’ description in section 3. Section 4 de- scribes methodology and section 5 discusses about The datasets used in this work were taken results. Finally, we conclude the paper in section 6. from three competitions - HASOC-Dravidian- CodeMix - FIRE 2020 (Chakravarthi et al., 2 Related works 2020a,c,b,d; Chakravarthi, 2020) (henceforth HDCM), Sentiment Analysis for Dravidian Lan- Two major shared tasks organized in offensive lan- guages in Code-Mixed Text (Chakravarthi et al., guage identification are OffensEval 2019 (Zampieri 2020a,d)(henceforth SADL) and Offensive Lan- et al., 2019) and GermEval (Struß et al., 2019). guage Identification in Dravidian Languages Other tasks related to offensive speech identifica- (Chakravarthi et al., 2021; Hande et al., 2020; tion include HASOC-19 (Mandl et al., 2019) which Chakravarthi et al., 2020d,a)(henceforth ODL). As dealt with hate speech and offensive content identi- a part of HDCM, there were two binary classi- fication in Indo-European languages, TRAC-2020 fication tasks, with the second task having two (Kumar et al., 2020), which dealt with aggression subtasks. The objective of all of the tasks was identification in Bangla, Hindi and English. While the same: given a Youtube comment, classify HASOC-19 and TRAC 2020 dealt with offensive it as offensive or not offensive. But the format speech identification in Indian languages of Bangla and language of data provided to different tasks and Hindi, HASOC-Dravidian-CodeMix - FIRE were different: Code-mixed Malayalam for Task- 2020 (Chakravarthi et al., 2020b) is the first shared 1(henceforth referred to as Mal-CM), Tanglish for task to conduct offensive speech identification task Task- 2a(henceforth referred to as Tanglish), and in Dravidian languages. Manglish for Task-2b(henceforth referred to as Researchers used a wide variety of techniques Manglish). From ODL, we used the Kanglish for the identification of offensive language. Re- dataset provided. Originally, the task proposed cently, (Ranasinghe and Zampieri, 2020)(work pub- in ODL is a multi-class classification problem with lished after the first version of this paper was sub- multiple offensive categories. However, to main- mitted) used the XLM-RoBERTa model for of- tain uniformity among the datasets, we divided fensive language identification in Bengali, Hindi, the classes into two - offensive and not offensive. and Spanish. They show that the XLM-R model Offensive-Targeted-Insult-Individual, Offensive- beats all other previous approaches for offensive Targeted-Insult-Group, Offensive-Untargeted, and language detection. (Saha et al., 2019), used the Offensive-Targeted-Insult-Other classes in the LGBM classifier on top of the combination of mul- dataset are renamed as offensive, and all the non- tilingual BERT and LASER pre-trained embed- Kannada posts are removed from the dataset. As far dings in HASOC-19. (Mishra and Mishra, 2019) as SADL is concerned, we used the given sentiment fine-tuned monolingual and multilingual BERT analysis datasets as helper datasets in inter-task based network models to achieve good results in transfer learning technique(4.5). OLID (Zampieri identifying hate speech. (Risch and Krestel, 2020) et al., 2019) is a dataset for offensive language iden- 19 Task Train Dev Test tweets where users tend to write informally using OFF NOT multiple languages. Also, translating romanized Mal-CM 567 2633 400 400 non-English language words into that particular language does not make any sense in our context. Tanglish 1980 2020 - 940 For example, translating the word ”Maram”(which Manglish 1952 2048 - 951 means tree in Tamil) directly into Tamil would se- Kanglish 1151 3544 586 778 riously affect the entire sentence’s semantics in OLID 4400 8840 - 860 which the word is present because ”Maram” will be treated as an English word. In many cases, Table 1: Statistics of the offensive speech datasets used valid translations from English to a non-English in this work. language would not be available for words. So, as a solution to this problem, we propose tification in the English language and we used it as selective transliteration and translation of the text. a helper dataset for inter-language transfer learn- In effect, this process of conversion of romanized ing technique (4.4). It consists of 14,100 tweets or code-mixed text(for example, Tanglish) is to as stated before; each tweet is annotated as offen- transliterate the romanized native language(Tamil) sive or not-offensive(subtask-A). The statistics of words in the text into Tamil and translate the En- the offensive speech datasets used in this work are glish words in the text into Tamil selectively.