Code-Switching Patterns Can Be an Effective Route to Improve Performance of Downstream NLP Applications: a Case Study of Humour, Sarcasm and Hate Speech Detection
Total Page:16
File Type:pdf, Size:1020Kb
Code-Switching Patterns Can Be an Effective Route to Improve Performance of Downstream NLP Applications: A Case Study of Humour, Sarcasm and Hate Speech Detection Srijan Bansal1, G Vishal2, Ayush Suhane3, Jasabanta Patro4, Animesh Mukherjee5 Indian Institute of Technology, Kharagpur, West Bengal, India - 721302 f1srijanbansal97, 2vishal g, 3ayushsuhane99 [email protected], [email protected], [email protected] Abstract 1.1 Motivation The primary motivation for the current work is de- In this paper we demonstrate how code- rived from (Vizca´ıno, 2011) where the author notes switching patterns can be utilised to improve – “The switch itself may be the object of humour”. various downstream NLP applications. In par- In fact, (Siegel, 1995) has studied humour in the ticular, we encode different switching features Fijian language and notes that when trying to be to improve humour, sarcasm and hate speech detection tasks. We believe that this simple comical, or convey humour, speakers switch from linguistic observation can also be potentially Fijian to Hindi. Therefore, humour here is pro- helpful in improving other similar NLP appli- duced by the change of code rather than by the cations. referential meaning or content of the message. The paper also talks about similar phenomena observed 1 Introduction in Spanish-English cases. In a study of English-Hindi code-switching and Code-mixing/switching in social media has become swearing patterns on social networks (Agarwal commonplace. Over the past few years, the NLP et al., 2017), the authors show that when people research community has in fact started to vigor- code-switch, there is a strong preference for swear- ously investigate various properties of such code- ing in the dominant language. These studies to- switched posts to build downstream applications. gether lead us to hypothesize that the patterns of The author in (Hidayat, 2012) demonstrated that switching might be useful in building various NLP inter-sentential switching is preferred more than applications. intra-sentential switching by Facebook users. Fur- ther while 45% of the switching was done for real 1.2 The present work lexical needs, 40% was for discussing a particular To corroborate our hypothesis, in this paper, we topic and 5% for content classification. In another consider three downstream applications – (i) hu- study (Dey and Fung, 2014) interviewed Hindi- mour detection (Khandelwal et al., 2018), (ii) sar- English bilingual students and reported that 67% of casm detection (Swami et al., 2018) and (iii) hate the words were in Hindi and 33% in English. Re- speech detection (Bohra et al., 2018) for Hindi- cently, many down stream applications have been English code-switched data. We first provide em- designed for code-mixed text. (Han et al., 2012) pirical evidence that the switching patterns between attempted to construct a normalisation dictionary native (Hindi) and foreign (English) words distin- offline using the distributional similarity of tokens guish the two classes of the post, i.e., humour vs plus their string edit distance. (Vyas et al., 2014) non-humour or sarcastic vs non-sarcastic or hateful developed a POS tagging framework for Hindi- vs non-hateful. We then featurise these patterns English data. and pump them in the state-of-the-art classification More nuanced applications like humour detec- models to show the benefits. We obtain a macro-F1 tion (Khandelwal et al., 2018), sarcasm detec- improvement of 2.62%, 1.85% and 3.36% over the tion (Swami et al., 2018) and hate speech detec- baselines on the tasks of humour detection, sarcasm tion (Bohra et al., 2018) have been targeted for detection and hate speech detection respectively. code-switched data in the last two to three years. As a next step, we introduce a modern deep neu- 1018 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1018–1023 July 5 - 10, 2020. c 2020 Association for Computational Linguistics ral model (HAN - Hierarchical Attention Network Sarcasm: Sarcasm dataset released by (Swami (Yang et al., 2016)) to improve the performance et al., 2018) contains tweets that have hashtags of the models further. Finally, we concatenate the #sarcasm and #irony. Authors used other keywords switching features in the last hidden layer of the such as ‘bollywood’, ‘cricket’ and ‘politics’ to col- HAN and pass it to the softmax layer for classifi- lect sarcastic tweets from these domains. In this cation. This final architecture allows us to obtain a case, the dataset is heavily unbalanced (see Ta- macro-F1 improvement of 4.9%, 4.7% and 17.7% ble1). Here the positive class refers to sarcastic over the original baselines on the tasks of humour tweets and the negative class means non-sarcastic detection, sarcasm detection and hate speech detec- tweets. Some representative examples from our tion respectively. data showing the point where the sarcasm starts and ends. 2 Dataset • said aib filthy pandit ji, sarcasmstart aap jo bol rahe ho woh kya shuddh sanskrit hai We consider three datasets consisting of Hindi (hi) sarcasm ? irony shameonyou4 - English (en) code-mixed tweets scraped from end • irony bappi lahiri sings sarcasm sona Twitter for our experiments - Humour, Sarcasm start nahi chandi nahi yaar toh mila arre pyaar kar and Hate. We discuss the details of each of these le sarcasm 5 datasets below. end Hate speech:(Bohra et al., 2018) created the cor- + - Tweets Tokens Switching* pus using the tweets posted online in the last five Humour 1755 1698 3453 9851 2.20 years which have a good propensity to contain hate Sarcasm 504 4746 5250 14930 2.13 speech (see Table1). Authors mined tweets by Hate 1661 2914 4575 10453 4.34 selecting certain hashtags and keywords from ‘pol- Table 1: Dataset description (* denotes average/tweet). itics’, ‘public protests’, ‘riots’ etc. The positive class refers to a hateful tweets while the negative 6 Humour: Humour dataset was released by (Khan- class means non-hateful tweets . An example of delwal et al., 2018) and has Hindi-English code- hate tweet showing the point of switch correspond- mixed tweets from domains like ‘sports’, ‘politics’, ing to the start and the end of the hate component. • I hate my university, hatestart koi us jagah ko ‘entertainment’ etc. The dataset has uniform dis- 7 tribution of tweets in each category to yield better aag laga dey hateend . supervised classification results (see Table1) as 3 Switching features described by (Du et al., 2014). Here the positive class refers to humorous tweets while the negative In this section, we outline the key contribution of class corresponds to non-humorous tweet. Some this work. In particular, we identify how patterns representative examples from the data showing the of switching correlate with the tweet text being hu- point of switch corresponding to the start and the morous, sarcastic or hateful. We outline a synopsis end of the humour component. of our investigation below. • women can crib on things like humour start 3.1 Switching and NLP tasks bhaiyya ye shakkar bahot zyada meethi hai 1 In this section, we identify how switching be- humourend, koi aur quality dikhao • shashi kapoor trending on mothersday havior is related to the three NLP tasks at our 4 how apt, humourstart mere paas ma hai Gloss: said aib filthy pandit ji, whatever you are telling is 2 it pure sanskrit? irony shameonyou. humourend 5irony bappi lahiri sings Gloss: doesn’t matter you do not • political journey of kejriwal, from get gold or silver, you have got a friend to love. 6 humourstart mujhe chahiye swaraj The dataset released by this paper only had the hate/non- hate tags for each tweet. However, the language tag for each humourend to humourstart mujhe chahiye 3 word required for our experiments was not available. Two laluraj humourend of the authors independently language tagged the data and obtained an agreement of 98.1%. While language tagging, we 1Gloss: women can crib on things like brother the sugar is noted that the dataset is a mixed bag including hate speech, a little more sweet, show a different quality. offensive and abusive tweets which have already been shown 2Gloss: shashi kapoor trending on mothersday how apt, I to be different in earlier works (Waseem et al., 2017). How- have my mother with me. ever, this was the only Hindi-English code-mixed hate speech 3Gloss: political journey of kejriwal, from I want swaraj dataset available. to I want laluraj. 7Gloss: I hate my university. Someone burn that place. 1019 hand. Let Q be the property that a sentence has Feature name Description en hi switches The number of en to hi switches in a sentence en words which are surrounded by hi words, that hi en switches The number of hi to en switches in a sentence is there exists an English word in a Hindi con- V The total number of switches in a sentence fraction en Fraction of English words in a sentence text. For instance, the tweet koi hi to hi pray en fraction hi Fraction of Hindi words in a sentence karo hi mere hi liye hi bhi hi satisfies the property mean hi en Mean of hi en vector stddev hi en Standard deviation of hi en vector Q. However, bumrah hi dono hi wicketo hi ke hi mean en hi Mean of en hi vector beech hi gumrah hi ho hi gaya hi does not satisfy stddev en hi Standard deviation of en hi vector Q. Table 3: Description of the switching features. We performed a statistical analysis to determine the correlation between the switching patterns and a classification task at hand (represented by T ). Let While we have tested on one language pair (Hindi- us denote the probability that a tweet belongs to English), our hypothesis is generic and has been a positive class for a task T given that it satisfies already noted by linguists earlier (Vizca´ıno, 2011).