
Offensive language identification in Dravidian code mixed social media text Sunil Saumya1, Abhinav Kumar2 and Jyoti Prakash Singh2 1Indian Institute of Information Technology Dharwad, Karnataka, India 2National Institute of Technology Patna, Bihar, India [email protected], [email protected], [email protected] Abstract that the automation of hate speech detection is a more reliable solution. (Davidson et al., 2017) ex- Hate speech and offensive language recog- nition in social media platforms have been tracted N-gram TF-IDF features from tweets us- an active field of research over recent years. ing logistical regression to classify each tweet in In non-native English spoken countries, so- hate, offensive and non-offensive classes. Another cial media texts are mostly in code mixed or model for the detection of the cyberbullying in- script mixed/switched form. The current study stances was presented by (Kumari and Singh, 2020) presents extensive experiments using multiple with a genetic algorithm to optimize the distinguish- machine learning, deep learning, and trans- ing features of multimodal posts. (Agarwal and fer learning models to detect offensive content Sureka, 2017) used the linguistic, semantic and on Twitter. The data set used for this study are in Tanglish (Tamil and English), Man- sentimental feature to detect racial content. The glish (Malayalam and English) code-mixed, LSTM and CNN based model for recognising hate and Malayalam script-mixed. The experimen- speech in the social media posts were explored by tal results showed that 1 to 6-gram character (Kapil et al., 2020). (Badjatiya et al., 2017) ex- TF-IDF features are better for the said task. ploited the semantic word embedding to classify The best performing models were naive bayes, each tweet into racist, sexist and neither class. An- logistic regression, and vanilla neural network other deep learning model for the detection of hate for the dataset Tamil code-mix, Malayalam code-mixed, and Malayalam script-mixed, re- speech was proposed by (Paul et al., 2020). How- spectively instead of more popular transfer ever, most of the works for hate speech detection learning models such as BERT and ULMFiT were validated with English datasets only. and hybrid deep models. In a country such as India, the majority of people in social media use at least two languages, primar- 1 Introduction ily English and Hindi (or say Hinglish). These The hate speech is generally defined as any com- texts are considered bilingual. In a bilingual set- munication which humiliates or denigrates an indi- ting, the script of the entire post may be same with vidual or a group based on the characteristics such words coming from both of these languages termed as colour, ethnicity, sexual orientation, nationality, as mixed-code (or code mix) text. A few popular race and religion. Due to huge volume of user- code mixed posts in India are English and Hindi generated content on the web, particularly social (or say Hinglish), Tanglish (Tamil and English) networks such as Twitter, Facebook, and so on, the (Chakravarthi et al., 2020c), Manglish (Malayalam problem of detecting and probably restricting Hate and English) (Chakravarthi et al., 2020a), Kanglish Speech on these platforms has become a very criti- (Kannada and English) (Hande et al., 2020), and cal issue (Del Vigna12 et al., 2017). Hate speech so on. The Tamil language is one of the world’s lasts forever on these social platforms compared to longest-enduring traditional languages, with a set physical abuse and terribly affects the individual on of experiences tracing all the way back to 600 BCE. the mental status creating depression, sleeplessness Tamil writing is overwhelmed by verse, particu- and even suicide (Ullmann and Tomalin, 2020). larly Sangam writing, which is made out of sonnets Owing to the high frequency of posts, detecting formed between 600 BCE and 300 CE. The main hate speech on social media manually is almost Tamil creator was the writer and thinker Thiruval- impossible. Some recent researches have indicated luvar, who composed the Tirukkua, a gathering of 36 Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 36–45 April 20, 2021 ©2021 Association for Computational Linguistics compositions on morals, legislative issues, love and a few state-of-art techniques presented to handle ethical quality broadly thought to be the best work such issues. of Tamil writing. Tamil has the oldest extant liter- Most of the analysis proposed for the detection ature among Dravidian languages. All Dravidian of hate contents were validated with monolingual languages evolved from classical Tamil language datasets. It is relatively easy to build a monolingual (Thavareesan and Mahesan, 2019, 2020a,b). Even model, since (i) it is readily accessible, (ii) it learns though they have their own scripts still in the Inter- a single language dictionary, and (iii) unknown to- net code-mixing comments can be found in these ken frequency is lower in the test data. (Davidson languages (Chakravarthi, 2020b). Identifying hate et al., 2017) worked on 25000 tweets in English content in such bilingual or code mixed language and reported that tweets that contained racism and is a very challenging task (Jose et al., 2020; Priyad- homophobic contexts were hate speech and tweets harshini et al., 2020; Chakravarthi, 2020a). An that contained sexism contexts were offensive con- automatic model which is trained in a monolingual tents. Other work, on English data, was proposed context to detect hate posts may not yield the same by (Waseem and Hovy, 2016) where n-gram fea- result when tested bilingually or with a code-mix tures were extracted for identifying sexiest, racism, (Puranik et al., 2021; Hegde et al., 2021; Yasaswini and none class. et al., 2021; Ghanghor et al., 2021b,a). This is be- cause each system learns and recognises words in Apart from this, some works were reported on a the given vocabulary. When a new word is encoun- multilingual dataset where scripts of two or more tered, which is not in the vocabulary, it is marked as languages are mixed. (Kumar et al., 2018) pro- an undefined token that makes no difference in the posed a model for multi-lingual datasets containing estimation of the model. Therefore, when checked aggressive and non-aggressive comments in En- with the language in other scripts the model’s per- glish as well as Hindi from Facebook and Twitter. formance decreases. (Samghabadi et al., 2018) used ensemble learn- The current study identifies the hate content in ing based on various machine learning classifiers Tanglish, Manglish and Malayalam script mixed in such as Logistic Regression, SVM with word n- tweets and validated with the dataset provided in gram, character n-gram, word embedding, and sen- HASOC-Dravidian-CodeMix-FIRE2020 challenge timent as a feature set. They found that combined (Chakravarthi et al., 2020b). The dataset proposed words and character n-gram features performed in the challenge was collected from Twitter. A va- better than an individual feature. (Srivastava et al., riety of deep learning models have been examined 2018) identified online social aggression on the in the current paper to distinguish offensive posts Facebook comment in a multilingual scenario and from script-mixed posts. Along with that we also Wikipedia toxic comments using stacked LSTM examined a few transfer learning models like BERT units followed by convolution layer and fastText (Devlin et al., 2018a) and ULMFit (Howard and as word representation. They achieved 0.98 AUC Ruder, 2018) for the classification task. for Wikipedia toxic comment classification and a The rest of the article is summarized as follows: weighted F1 score of 0.63 for the Facebook test Section2 presents the overview of articles proposed set and 0.59 for the Twitter test set. (Mandl et al., in the domain of hate or offensive speech. The task 2020; Chakravarthi et al., 2020d) presented several and dataset description is explained in Section3. models and their results for English, Hindi, and This followed by the explanation of the proposed German datasets. They reported the best model methodology in Section4. The experimental re- as long short term memory-based network that sults and discussion are explained in Section5 and could capture the multilingual context in a bet- 6. The paper concludes by highlighting the main ter way. Bohra et al. (2018) extended the ear- findings in Section7. lier research of hate speech detection for code mixed tweets of Hindi and English. (Kumari et al., 2 Related works 2021) presented a Convolutional Neural Network (CNN) and Binary Particle Swarm Optimization The hate speech identification in social media texts (BPSO) based model to classify multimodal posts suffers from many challenges; such as code-mixed with images and text into non-aggressive, medium- social media content, script-mixed social media aggressive and high-aggressive classes. Another contents, and so on. This section sheds light on multilingual context could be code-mixed where 37 two languages are written in a single script. For extracted features were fed to classifiers like Sup- example, (Chakravarthi et al., 2021b; Chakravarthi port Vector Machine (SVM), Logistic Regression and Muralidaran, 2021; Chakravarthi et al., 2021a; (LR), Naive Bayes (NB), Random Forest (RF). The Suryawanshi and Chakravarthi, 2021) proposed a detailed performance report of word n-grams and code mixed Dravidian data in Tamil, Malayalam, character n-grams are shown in Section5. and Kannada. (Bohra et al., 2018) developed a Hinglish dataset from Twitter. They reported pre- 4.2 Neural learning-based models liminary experiment results of Support Vector Ma- Initially, the character n-grams TF-IDF features chine (SVM) and Random Forest (RF) classifiers (1-6 grams) extracted in previous Section 4.1 were with n-grams and lexicon-based features with an used as an input to a vanilla neural network (VNN) accuracy of 0.71.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-