Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada
Total Page:16
File Type:pdf, Size:1020Kb
Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada Bharathi Raja Chakravarthi1, Ruba Priyadharshini2, Navya Jose3 Anand Kumar M 4,Thomas Mandl 5, Prasanna Kumar Kumaresan3, Rahul Ponnusamy3, Hariharan R L 4, John Philip McCrae 1 and Elizabeth Sherly3 1National University of Ireland Galway, 2Madurai Kamaraj University, 3Indian Institute of Information Technology and Management-Kerala, 4National Institute of Technology Karnataka Surathkal, 5University of Hildesheim Germany [email protected] Abstract Research in hate speech detection (Kumar et al., Detecting offensive language in social me- 2018) or offensive language detection (Zampieri dia in local languages is critical for mod- et al., 2020; Mandl et al., 2020) using Natural Lan- erating user-generated content. Thus, the guage Processing (NLP) has significantly improved field of offensive language identification for in recent years. However, the work on under- under-resourced languages like Tamil, Malay- resourced languages is still limited (Chakravarthi, alam and Kannada is of essential impor- 2020). For example, under-resourced languages tance. As user-generated content is often such as Tamil, Malayalam, and Kannada lack tools code-mixed and not well studied for under- resourced languages, it is imperative to cre- and datasets (Chakravarthi et al., 2020a,c; Thava- ate resources and conduct benchmark stud- reesan and Mahesan, 2019, 2020a,b). Recently, ies to encourage research in under-resourced shared sentiment analysis for Tamil and Malay- Dravidian languages. We created a shared alam by Chakravarthi et al.(2020d) and offensive task on offensive language detection in Dra- language identification in Tamil and Malayalam vidian languages. We summarize the dataset by Chakravarthi et al.(2020b) paved the wave for this challenge which are openly avail- for more research on Dravidian languages. Tamil, able at https://competitions.codalab. Malayalam and Kannada belong to the Dravidian org/competitions/27654, and present an overview of the methods and the results of the language family and are spoken by some 220 mil- competing systems. lion people in the Indian subcontinent, Singapore and Sri Lanka (Krishnamurti, 2003). It is essen- 1 Introduction tial to create NLP systems such as hate speech The dawn of social media helped to bridge the gap detection or offensive language detection in local between the borders and paved the way for the peo- languages since most user-generated content is in ple to communicate or express their opinions more such languages. easily than any other time in human history (Edo- The Dravidian language family is reported as somwan et al., 2011). But with this advent of social being approximately 4500 years old by Bayesian media, the platforms like YouTube, Facebook, and phylogenetic analysis of cognate-coded lexical data Twitter not only helped information sharing and (Kolipakam et al., 2018). The Dravidian languages networking but also became the place that target, scripts are first attested in the 580 BCE as Tamili1 defame and marginalize people merely on the basis script inscribed on the pottery of Keezhadi, Siva- of their physical appearance, religion or sexual ori- gangai and Madurai district of Tamil Nadu, India entation (Keipi et al., 2016; Benikova et al., 2018; (Sivanantham and Seran, 2019)2 by Tamil Nadu Pamungkas et al., 2020). Social media has moulded State Department of Archaeology and Archaeo- itself into its specialised tool with which the people logical Survey of India. The modern Dravidian are verbally threatened and cornered not for what languages have its own scripts evolved from Tamili they have done but for who they are (Maitra and scripts. The scripts are alpha-syllabic, belonging McGowan, 2012; Patton et al., 2013). Undoubt- to a family of the abugida writing systems that edly, the infiltration depth and width of this ‘digital are partially alphabetic and partially syllable-based wonder’ equipped the erstwhile ‘invisible and so- cially paralysed’ communities to play cards, if not 1also called Damili or Dramili or Tamil-Brahmi extravagantly noticeable (Barnidge et al., 2019). 2Keeladi-Book-English-18-09-2019.pdf 133 Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 133–145 April 20, 2021 ©2021 Association for Computational Linguistics (Albert et al., 1985; Rajendran, 2004). The writing 2 Task Description system of Tamili was explained in the old grammar Offensive language identification for Dravidian lan- text Tolkappiyam which dates are various proposed guages at different levels of complexity were devel- between 9th century BCE to 6nd century BCE (Pil- oped following the work of (Zampieri et al., 2019). lai, 1904; Swamy, 1975; Zvelebil, 1991; Takahashi, It was customized to our annotation method from 1995) and in the Jaina work Samavayanga Sutta three-level hierarchical annotation schema. A new and Pannavana Sutta, these two Jain works date to category Not in intended language was added to 3rd-4th century BCE (Salomon, 1998). include comments written in a language other than Up until around the 15th-century CE, Malay- the Dravidian languages. Annotations decision for alam was the west coast dialect of Tamil (Black- offensive language categories were split into six burn, 2006). Due to geographical isolation by the labels to simplify the annotation process. steep Western Ghats from the main speech commu- • Not Offensive: Comment/post does not have nity, the dialect eventually evolved into a distinct offence, obscenity, swearing, or profanity. language by the 15th century (Sekhar, 1951). The first literary work in Malayalam was Ramacari- • Offensive Untargeted: Comment/post have tam (”Deeds of Rama”) was written using Tamil offence, obscenity, swearing, or profanity not Grantha script which was used to write Pali, Prakrit directed towards any target. These are the and foreign words in Tamil (Andronov, 1996). The comments/posts which have inadmissible lan- first printed work in an Indian language and script guage without targeting anyone. was a Roman Catholic catechism translated by Hen- rique which is Thambiran Vanakkam (Doctrina • Offensive Targeted Individual: Com- Christam en Lingua Malauar Tamul in Portuguese) ment/post have offence, obscenity, swearing, on 20 October 1,578 CE at Quilon, Venad (present or profanity which targets an individual. day Kerala) (George, 1972). Similarly, Kannada • Offensive Targeted Group: Comment/post also split from Tamil by 9th century CE; for exam- have offence, obscenity, swearing, or profan- ple, Taayviru is a term of Kannada origin and is ity which targets a group or a community. found in a Tamil inscription from the 4th-century CE. The Kappe Arabhatta record of 700 CE is the • Offensive Targeted Other: Comment/post oldest extant form of Kannada poetry in the Tripadi have offence, obscenity, swearing, or profan- metre, it is based in part on Kavyadarsha (Rawlin- ity which does not belong to any of the previ- son, 1930; Hande et al., 2020). ous two classes. However, user-generated content is often • Not in indented language: If the comment adopted to Roman script for typing due to histor- is not in the intended language. For example, ical reason and modern computer keyboard lay- in the Malayalam task, if the sentence does outs. Hence, the majority of user-generated data for not contain Malayalam written in Malayalam these Dravidian languages are code-mixed (Priyad- script or Latin script, then it is not Malayalam. harshini et al., 2020; Jose et al., 2020). Due to Sample of not-offensive comments in our dataset the growth of social media platforms across the provided for the participants is given below with world and the possibility of writing content without the corresponding English glosses.. any moderation, users write content in multilingual code-switching without grammatical restrictions • Tamil-English: Paka thana poro movie la and using non-native scripts (Chakravarthi et al., Enna irukunu baki ellam 2019). In linguistics, code-switching is switching English: “You will see what is the movie” between two or more language in the same utter- ance. This explosion of code-mixed user-generated Sample offensive language text from the dataset content in the social media platforms aroused in- terest in NLP. In this paper, we dicuss about the • Tamil-English: Ghibran p—a vitta ungalukku offensive language identification shared task for vera aale kedaika laya da. Tamil, Malayalam and Kannada languages and the English: “Dont you get anyone else other than participant’s submissions. Ghibran V—a” 134 Language Tamil Malayalam Kannada Number of words 511,734 202,134 65,702 Vocabulary size 94,772 40,729 20,796 Number of comments 43,919 20,010 7,772 Number of sentences 52,617 23,652 8,586 Average number of words per sentence 11 10 8 Average number of sentences per comment 1 1 1 Table 1: Corpus statistics for Offensive Language Identification Class Tamil Malayalam Kannada Not Offensive 31,808 17,697 4,336 Offensive Untargeted 3,630 240 278 Offensive Targeted Individual 2,965 290 628 Offensive Targeted Group 3,140 176 418 Offensive Targeted Others 590 - 153 Not in indented language 1,786 1,607 1,898 Total 43,919 20,010 7,772 Table 2: Offensive language Identification Dataset Distribution. This is an example of Offensive Targeted Indi- Comment Scraper tool3 was used to collect com- vidual. Mr. Ghibran is a popular music composer ments from YouTube social media. These com- in Tamil films, and the author tries to insult him. ments were utilized to make the datasets for of- fensive language identification shared task. We • Malayalam-English: Verupikkal manju ul- collected comments that contain code-mixing at lathu ozhichal baki ellam super aanu ee various levels of the text, with enough representa- padam. tion for each sentiment in all the three languages. 4 English: “Except the presence of disgusting Langdetect library was used to detect different Manju, everything is super in this movie” languages apart and eliminate the unintended lan- guages.