Offensive Language Identification and Meme Classification in Tamil
Total Page:16
File Type:pdf, Size:1020Kb
IIITK@DravidianLangTech-EACL2021: Offensive Language Identification and Meme Classification in Tamil, Malayalam and Kannada Nikhil Kumar Ghanghor1, Prameshwari Krishnamurthy,2 Sajeetha Thavareesan 3, Ruba Priyadarshini,4 Bharathi Raja Chakravarthi5 1Indian Institute of Information Technology Kottayam 2University of Hyderbad, India, 3ULTRA Arts and Science College, Tamil Nadu, India 4 Eastern University, Sri Lanka, 5 National University of Ireland Galway [email protected] Abstract for fine-tuning it over other computer vision tasks to achieve promising results. This paper describes the IIITK team’s submis- Another breakthrough in the community was sions to the offensive language identification the release of BERT (Devlin et al., 2019) which and troll memes classification shared tasks for Dravidian languages at DravidianLangTech was being built upon the transformer architecture 2021 workshop@EACL 2021. We have used (Vaswani et al., 2017) with several innovative train- the transformer-based pretrained models along ing strategies at the time of its release topped the with their customized versions with custom GLUE (Wang et al., 2019) benchmarks. The in- loss functions. State of the art pretrained CNN troduction of BERT paved the ways for the trans- models were also used for image-related tasks. fer learning on the Natural Language Processing Our best configuration for Tamil troll meme (NLP) tasks. However, one particular issue with classification achieved a 0.55 weighted aver- age F1 score, and for offensive language iden- these huge pretrained models was that they were tification, our system achieved weighted F1 most of the times not available for under-resourced scores of 0.75 for Tamil, 0.95 for Malayalam, languages. and 0.71 for Kannada. Our rank on Tamil The Offensive Language Identification task troll meme classification is 2, and offensive aimed to detect the presence of offensive content language identification in Tamil, Malayalam, in texts of such under-resourced languages (Mandl and Kannada is 3, 3 and 4. We have open- et al., 2020; Chakravarthi et al., 2020b; Ghanghor sourced our code implementations for all the models across both the tasks on GitHub1. et al., 2021; Yasaswini et al., 2021; Puranik et al., 2021). The texts were in the native language scripts 1 Introduction as well as of code-mixed nature (Chakravarthi, 2020a). Code-mixing means the mixing of two With the rise of deep learning in the past decade, or more languages or language varieties in speech we have seen some groundbreaking milestones (Priyadharshini et al., 2020; Jose et al., 2020). The achieved by various machine learning (ML) models Meme Classification task focused on classifying being built over the era. The AlexNet (Krizhevsky memes into a troll and not a troll. Along with the et al., 2012) architecture achieved the first sig- memes, their extracted captions were also being nificant milestone on ILVRSC challenge, which provided in Tamil native scripts and code-mixed showed the state of the art results at their time nature (Hegde et al., 2021). All the three lan- of recording on the image classification tasks. Al- guages Kannada, Malayalam and Tamil for which though transfer learning was proposed much before the datasets were being provided across the two in the community, it was not used much due to the tasks are under resourced in nature and belong to lack of computational resources. The success of Dravidian languages’ family (Chakravarthi, 2020b). AlexNet reignited the usage of transfer learning The Dravidian language family is dated to approxi- (Bozinovski, 2020) which leverages the knowledge mately 4500 years old (Kolipakam et al., 2018). In gained by the models having parameters of the size Jain and Buddhist texts written in Pali and Prakrit of millions to billions and which are being trained in ancient times, they referred to Tamil as Damil, over an enormously large dataset can be further use and then Dramil became Dravida and Dravidian. 1https://github.com/nikhil6041/ Tamil language family includes Kannada, Telugu, OLI-and-Meme-Classification Tulu and Malayalam. The Tamil languages scripts 222 Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 222–229 April 20, 2021 ©2021 Association for Computational Linguistics are first attested in the 580 BCE as Tamili2 script Meme Classification has been one of the major inscribed on the pottery of Keezhadi, Sivagangai tasks that various social media platforms aim to and Madurai district of Tamil Nadu, India (Sivanan- solve in the past decade. Various methods have tham and Seran, 2019). The Dravidian languages been proposed in order to classify memes. How- have its scripts evolved from Tamili scripts. ever, it is still an unsolved problem because of In this paper, we showcase our efforts for the two its multimodal nature, making it extremely hard shared tasks on Dravidian languages. We par- to have a system correctly classify the memes ticipated in the shared task Offensive Language (Suryawanshi et al., 2020a). Classification of Identification (Chakravarthi et al., 2021) for Tamil, memes needs a visual and linguistic approach to Malayalam and Kannada. We also participated in be followed. Hu and Flaxman(2018) did work the shared task on Tamil Troll Meme Classification similar to memes analysis. They used the Tumblr (Suryawanshi and Chakravarthi, 2021) task. We dataset using memes, which had images with sev- have used the Transformer and BERT based mod- eral tags associated with them. They dealt with els in our work. This paper summarizes our works texts and visual features independently and built a on the two shared tasks. Each of the following model to predict sentiments behind those pictures sections is being subdivided further into two sec- with corresponding tags and obtained good results. tions to detail every aspect of the procedures being Using this idea of dealing with different modalities worked out on them. independently, different researchers came forward with different implementations for this problem of 2 Related Work memes analysis. This particular problem became Offensive content detection from texts has been a flagship the task of SemEval-2020. The team part of some conferences as challenging tasks. having the best performance for sentiment analysis Some of the most popular tasks have been part was obtained by the (Keswani et al., 2020) where of the SemEval 3 organized by International Work- they used several different models for addressing shop on Semantic Evaluation. The tasks covered the problem by considering both the modalities and in SemEval varied from offensive content detec- coming up with several combinations using both tion in monolingual corpus to code-mixed cor- the modalities. pus collected from social media comments. How- However, the best performance was given by ever, most of these competitions had their tasks on considering the textual features only using BERT datasets based on English and other high resourced as the architecture for doing so for sentiment anal- languages from the west. HASOC-Dravidian- ysis of the memes. The overview paper (Sharma CodeMix-FIRE2020 4 was one such competition et al., 2020) of memotion analysis has summarized being organized for under-resourced languages the works of the participants and gives a clearer such as Tamil and Malayalam. picture of the status of the task of memotion analy- Ranasinghe et al.(2020) at HASOC-Dravidian- sis its reporting. Recently Facebook AI conducted 5 CodeMix-FIRE2020 used traditional Machine hateful memes challenge in 2020 whose winner Learning (ML) methods like Naive Bayes Clas- has achieved the best results using several visual, sifier (NBC) , Support Vector Machines (SVMs) linguistic transformer models. Zhu(2020) used an and Random Forest along with the pretrained trans- ensemble of four different Visual Linguistic models formers models like XLM-Roberta (XLMR) (Con- VL-BERT (Su et al., 2020), UNITER (Chen et al., neau et al., 2020) and BERT for the offensive con- 2020), VILLA (Gan et al., 2020), and ERNIE-Vil tent identification in code-mixed datasets ( Tamil- (Yu et al., 2020). English and Malayalam-English). Arora(2020) at HASOC-Dravidian-CodeMix-FIRE2020 used 3 Dataset Description ULMFit (Howard and Ruder, 2018) to pre-train on The competition organizers have released datasets a synthetically generated code-mixed dataset and for three different languages namely the Kannada then fine-tuned it to the downstream tasks of text dataset (Hande et al., 2020), Malayalam dataset classification. (Chakravarthi et al., 2020a) and Tamil dataset 2This is also called Damili or Dramili or Tamil-Brahmi (Chakravarthi et al., 2020c) languages for the Offen- 3https://semeval.github.io/ 4https://competitions.codalab.org/ 5https://ai.facebook.com/blog/ competitions/25295 hateful-memes-challenge-and-data-set/ 223 sive Language Identification shared task. However, upon the popular transformer architecture. We also for the Meme Classification shared task the dataset have word embeddings (Bengio et al., 2003) avail- (Suryawanshi et al., 2020b) was only released for able for the representations of words in a high di- the Tamil language. The train, dev and test set mensional vector space which are being used as distributions for both of them are as follows: a starting component of different neural network models. Distribution Kannada Malayalam Tamil We have used several pretrained multilingual Train 6,217 16,010 35,139 models in our works and used them to fine-tune the Dev 777 1,999 4,388 offensive language identification task. Most of the Test 778 2,001 4,392 work has been done by using huggingface trans- 6 Table 1: Offensive Language Dataset Distribution formers library . One key issue with the dataset was the skewness of it. We have used the Negative Log Likelihood (NLL) Loss with class weights, For the Meme Classification task, the competi- Self Adjusting Dice (Li et al., 2020) (Sadice) loss tion organizers have provided us with images and and applied it to address the class imbalance issues the extracted texts present in them.