Code-Switching Patterns Can Be an Effective Route to Improve Performance of Downstream NLP Applications: A Case Study of Humour, Sarcasm and Hate Speech Detection

Srijan Bansal1, Vishal2, Ayush Suhane3, Jasabanta Patro4, Animesh Mukherjee5 Indian Institute of Technology, Kharagpur, West Bengal, India - 721302 {1srijanbansal97, 2vishal g, 3ayushsuhane99 }@iitkgp.ac.in, [email protected], [email protected]

Abstract 1.1 Motivation The primary motivation for the current work is de- In this paper we demonstrate how code- rived from (Vizca´ıno, 2011) where the author notes switching patterns can be utilised to improve – “The switch itself may be the object of humour”. various downstream NLP applications. In par- In fact, (Siegel, 1995) has studied humour in the ticular, we encode different switching features Fijian language and notes that when trying to be to improve humour, sarcasm and hate speech detection tasks. We believe that this simple comical, or convey humour, speakers switch from linguistic observation can also be potentially Fijian to Hindi. Therefore, humour here is pro- helpful in improving other similar NLP appli- duced by the change of code rather than by the cations. referential meaning or content of the message. The paper also talks about similar phenomena observed 1 Introduction in Spanish-English cases. In a study of English-Hindi code-switching and Code-mixing/switching in social media has become swearing patterns on social networks (Agarwal commonplace. Over the past few years, the NLP et al., 2017), the authors show that when people research community has in fact started to vigor- code-switch, there is a strong preference for swear- ously investigate various properties of such code- ing in the dominant language. These studies to- switched posts to build downstream applications. gether lead us to hypothesize that the patterns of The author in (Hidayat, 2012) demonstrated that switching might be useful in building various NLP inter-sentential switching is preferred more than applications. intra-sentential switching by Facebook users. Fur- ther while 45% of the switching was done for real 1.2 The present work lexical needs, 40% was for discussing a particular To corroborate our hypothesis, in this paper, we topic and 5% for content classification. In another consider three downstream applications – (i) hu- study (Dey and Fung, 2014) interviewed Hindi- mour detection (Khandelwal et al., 2018), (ii) sar- English bilingual students and reported that 67% of casm detection (Swami et al., 2018) and (iii) hate the words were in Hindi and 33% in English. Re- speech detection (Bohra et al., 2018) for Hindi- cently, many down stream applications have been English code-switched data. We first provide em- designed for code-mixed text. (Han et al., 2012) pirical evidence that the switching patterns between attempted to construct a normalisation dictionary native (Hindi) and foreign (English) words distin- offline using the distributional similarity of tokens guish the two classes of the post, i.., humour vs plus their string edit distance. (Vyas et al., 2014) non-humour or sarcastic vs non-sarcastic or hateful developed a POS tagging framework for Hindi- vs non-hateful. We then featurise these patterns English data. and pump them in the state-of-the-art classification More nuanced applications like humour detec- models to show the benefits. We obtain a macro-F1 tion (Khandelwal et al., 2018), sarcasm detec- improvement of 2.62%, 1.85% and 3.36% over the tion (Swami et al., 2018) and hate speech detec- baselines on the tasks of humour detection, sarcasm tion (Bohra et al., 2018) have been targeted for detection and hate speech detection respectively. code-switched data in the last two to three years. As a next step, we introduce a modern deep neu-

1018 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1018–1023 July 5 - 10, 2020. 2020 Association for Computational Linguistics ral model (HAN - Hierarchical Attention Network Sarcasm: Sarcasm dataset released by (Swami (Yang et al., 2016)) to improve the performance et al., 2018) contains tweets that have hashtags of the models further. Finally, we concatenate the #sarcasm and #irony. Authors used other keywords switching features in the last hidden layer of the such as ‘bollywood’, ‘cricket’ and ‘politics’ to col- HAN and pass it to the softmax layer for classifi- lect sarcastic tweets from these domains. In this cation. This final architecture allows us to obtain a case, the dataset is heavily unbalanced (see Ta- macro-F1 improvement of 4.9%, 4.7% and 17.7% ble1). Here the positive class refers to sarcastic over the original baselines on the tasks of humour tweets and the negative class means non-sarcastic detection, sarcasm detection and hate speech detec- tweets. Some representative examples from our tion respectively. data showing the point where the sarcasm starts and ends. 2 Dataset • said aib filthy pandit ji, sarcasmstart aap jo bol rahe ho woh kya shuddh sanskrit hai We consider three datasets consisting of Hindi (hi) sarcasm ? irony shameonyou4 - English (en) code-mixed tweets scraped from end • irony bappi lahiri sings sarcasm sona Twitter for our experiments - Humour, Sarcasm start nahi chandi nahi yaar toh mila arre pyaar kar and Hate. We discuss the details of each of these le sarcasm 5 datasets below. end Hate speech:(Bohra et al., 2018) created the cor- + - Tweets Tokens Switching* pus using the tweets posted online in the last five Humour 1755 1698 3453 9851 2.20 years which have a good propensity to contain hate Sarcasm 504 4746 5250 14930 2.13 speech (see Table1). Authors mined tweets by Hate 1661 2914 4575 10453 4.34 selecting certain hashtags and keywords from ‘pol- Table 1: Dataset description (* denotes average/tweet). itics’, ‘public protests’, ‘riots’ etc. The positive class refers to a hateful tweets while the negative 6 Humour: Humour dataset was released by (Khan- class means non-hateful tweets . An example of delwal et al., 2018) and has Hindi-English code- hate tweet showing the point of switch correspond- mixed tweets from domains like ‘sports’, ‘politics’, ing to the start and the end of the hate component. • I hate my university, hatestart koi us jagah ko ‘entertainment’ etc. The dataset has uniform dis- 7 tribution of tweets in each category to yield better aag laga dey hateend . supervised classification results (see Table1) as 3 Switching features described by (Du et al., 2014). Here the positive class refers to humorous tweets while the negative In this section, we outline the key contribution of class corresponds to non-humorous tweet. Some this work. In particular, we identify how patterns representative examples from the data showing the of switching correlate with the tweet text being hu- point of switch corresponding to the start and the morous, sarcastic or hateful. We outline a synopsis end of the humour component. of our investigation below. • women can crib on things like humour start 3.1 Switching and NLP tasks bhaiyya ye shakkar bahot zyada meethi hai 1 In this section, we identify how switching be- humourend, koi aur quality dikhao • shashi kapoor trending on mothersday havior is related to the three NLP tasks at our 4 how apt, humourstart mere paas ma hai Gloss: said aib filthy pandit ji, whatever you are telling is 2 it pure sanskrit? irony shameonyou. humourend 5irony bappi lahiri sings Gloss: doesn’ matter you do not • political journey of kejriwal, from get gold or silver, you have got a friend to love. 6 humourstart mujhe chahiye swaraj The dataset released by this paper only had the hate/non- hate tags for each tweet. However, the language tag for each humourend to humourstart mujhe chahiye 3 word required for our experiments was not available. Two laluraj humourend of the authors independently language tagged the data and obtained an agreement of 98.1%. While language tagging, we 1Gloss: women can crib on things like brother the sugar is noted that the dataset is a mixed bag including hate speech, a little more sweet, show a different quality. offensive and abusive tweets which have already been shown 2Gloss: shashi kapoor trending on mothersday how apt, I to be different in earlier works (Waseem et al., 2017). How- have my mother with me. ever, this was the only Hindi-English code-mixed hate speech 3Gloss: political journey of kejriwal, from I want swaraj dataset available. to I want laluraj. 7Gloss: I hate my university. Someone burn that place.

1019 hand. Let be the property that a sentence has Feature name Description en hi switches The number of en to hi switches in a sentence en words which are surrounded by hi words, that hi en switches The number of hi to en switches in a sentence is there exists an English word in a Hindi con- The total number of switches in a sentence fraction en Fraction of English words in a sentence text. For instance, the tweet koi hi to hi pray en fraction hi Fraction of Hindi words in a sentence karo hi mere hi liye hi bhi hi satisfies the property mean hi en Mean of hi en vector stddev hi en Standard deviation of hi en vector Q. However, bumrah hi dono hi wicketo hi ke hi mean en hi Mean of en hi vector beech hi gumrah hi ho hi gaya hi does not satisfy stddev en hi Standard deviation of en hi vector Q. Table 3: Description of the switching features. We performed a statistical analysis to determine the correlation between the switching patterns and a classification task at hand (represented by T ). Let While we have tested on one language pair (Hindi- us denote the probability that a tweet belongs to English), our hypothesis is generic and has been a positive class for a task T given that it satisfies already noted by linguists earlier (Vizca´ıno, 2011). property Q by p(T |Q). Similarly, let p(T | ∼ Q) be the probability that the tweet belongs to the 3.2 Construction of the feature vector positive class for task T and does not satisfy the property Q. Motivated by the observations in the previous sec- Further let avg(|T ) be the average switching tion we construct a vector hi en[i] that denotes th in positive samples for the task T and avg(S| ∼ T ) the number of Hindi (hi) words before the i En- denote the average switching in negative samples glish (en) word and a vector en hi[i] that denotes th for the task T . the number of English (en) words before the i Hindi (hi) word. This can also be interpreted as T : Humour T : Sarcasm T : Hate the run-lengths of the Hindi and the English words p(T |Q) 0.56 0.28 0.36 p(T | ∼ Q) 0.50 0.42 0.41 in the code-mixed tweets. Based on these vectors avg(S|T ) 7.84 0.60 1.49 we define nine different features that capture the avg(S| ∼ T ) 6.50 0.89 1.54 switching patterns in the code-mixed tweets8. Table 2: Correlation of switching with different classi- An example feature vector computation: Con- fication tasks. sider the sentence - koi hi to hi pray en karo hi mere hi liye hi bhi hi. The main observations from this analysis for the hi en : [0, 0, 2, 0, 0, 0, 0] three tasks – humour, sarcasm and hate are noted in en hi : [0, 0, 0, 1, 1, 1, 1] 1 6 2 4 Table2. For the humour task, p(humour|Q) dom- Feature vector: [1, 1, 2, 7 , 7 , 7 , 0.69, 7 , 0.49] inates over p(humour| ∼ Q). Further the average number of switching for the positive samples in the humour task is larger than the average number 4 Experiments of switching for the negative samples. Finally, we observe a positive Pearson’s correlation coefficient 4.1 Pre-processing of 0.04 between a text being humorous and the text having the property Q. This together indicates that Tweets are tokenized and punctuation marks are the switching behavior has a positive connection removed. All the hashtags, mentions and urls are with a tweet being humorous. stored and converted to string ‘hashtag’, ‘mention’ On the other hand p(sarcasm| ∼ Q) as well and ‘url’ to capture the general semantics of the as p(hate| ∼ Q) respectively dominate over tweet. Camel-case hashtags were segregated and p(sarcasm|Q) and p(hate|Q). Moreover the av- included in the tokenized tweets (see (Belainine erage number of switching for the negative sam- et al., 2016), (Khandelwal et al., 2017)). For exam- #AadabArzHai ples for both these tasks is larger than the average ple, can be decomposed into three Aadab Arz Hai number of switching for the positive samples. The distinct words: , and . We use the Pearson’s correlation between a text being sarcas- same pre-processing for all the results presented in tic (hateful) and the text having the property Q this paper. is negative: -0.17 (-0.04). This shows there is an 8We tried with different other variants but empirically ob- overall negative connection between the switching serve that these nine features already subsumes all the neces- behavior and sarcasm/hate speech detection tasks. sary distinguishing qualities.

1020 4.2 Machine learning baselines the three tasks. Humour baseline (Khandelwal et al., 2018): Uses Handling data imbalance by sub-sampling: features such as -grams, bag-of-words, common Since the sarcasm dataset is heavily unbalanced words and hashtags to train the standard machine we sub-sampled the data to balance the classes. To learning models such as SVM and Random-Forest. this purpose, we categorise the negative samples The authors used character n-grams, as previous into those that are easy or hard to classify. Hypoth- work shows that this feature is very efficient in clas- esizing that if a model can predict the hard samples sifying text because they do not require expensive reliably it can do the same with the easy samples. text pre-processing techniques like tokenization, We trained a classifier model on the training dataset stemming and stop words removal. They are also and obtained the softmax score which represents language independent and can be used in code- p(sarcastic|text) for the test samples. Those test mixed texts. In their paper, the authors report the samples which have a score less than a very low results for tri-grams. confidence score (say 0.001) are removed imag- ining them to be easy samples. The dataset thus Sarcasm baseline (Swami et al., 2018): This got reduced and more balanced. It is important model also uses a combination of word n-grams, to note that positive samples are never removed. character n-grams, presence or absence of certain We validated this hypothesis through the test set. emoticons and sarcasm indicative tokens as fea- Our trained HAN model achieves an accuracy of tures. A sarcasm indicative score is computed and 94.4% in classifying the easy (thrown out) samples chi-squared feature reduction is used to take the as non-sarcastic thus justifying the sub-sampling. top 500 most relevant words. These were incorpo- rated into features used for classification. Standard Switching features: We include the switching fea- off-the-shelf machine learning models like SVM tures to the pre-final fully-connected layer of HAN and Random Forest were used. to observe if this harnesses additional benefits (see Figure1). Hate baseline (Bohra et al., 2018): The hate speech detection baseline also consists of similar features such as character n-grams, word n-grams, negation words 9 and a lexicon of hate indicative tokens. Chi-squared feature reduction method was used to decrease the dimensionality of the features. Once again SVM and Random Forest based classi- fiers were used for this task. Switching features: We plug in the nine switching features introduced in the previous section to the three baseline models for humour, sarcasm and hate speech detection.

4.3 Deep learning architecture In order to draw the benefits of the modern deep Figure 1: The overall HAN architecture along with the learning machinery, we build an end-to-end model switching features in the final layer. for the three tasks at hand. We use the Hierarchi- cal Attention Network (HAN) (Yang et al., 2016) which is one of the state-of-the-art models for text 4.4 Experimental Setup and document classification. It can represent sen- Train-test split: For all datasets, we maintain a tences in different levels of granularity by stacking train-test split of 0.8 - 0.2 and perform 10-fold recurrent neural networks on character, word and cross validation. sentence level by attending over the words which Parameters of the HAN: BiLSTMs: no dropout, are informative. We use the GRU implementation early stopping patience: 15, optimizer = ‘adam’ of HAN to encode the text representation for all (learning rate = 0.001, beta 1 = 0.9), loss = binary cross entropy, epochs = 200, batch size = 32, pre- 9see Christopher Pott’s sentiment tutorial: http://sentiment.christopherpotts.net/ trained word-embedding size = 50, hidden size: lingstruc.html [20, 60], dense output size (before concatenation):

1021 Model Humour Sarcasm Hate to Hindi to express the humour; such observations Baseline () 69.34 78.4 33.60 Baseline + Feature (BF) 71.16 79.85 34.73 have also been made in (Rudra et al., 2016) where HAN () 72.04 81.36 38.78 opinion expression was cited as a reason for switch- HAN + Feature (HF) 72.71 82.07 39.54 ing. Table 4: Summary of the results from different models Sarcasm being negatively correlated with switch- in terms of macro-F1 scores. - U test shows all ing, a tweet without having switching is more likely improvements of HF over B are significant. to be sarcastic. For instance, the tweet naadaan hi baalak hi kalyug hi ka hi vardaan hi hai hi ye hi, which bears no switching was labeled non-sarcastic [15, 30]. by the baseline. Our models (BF and HF) have rec- Pre-trained embeddings: We obtained pre- tified it and correctly detected it as sarcastic. trained embeddings by training GloVe from scratch Similarly, hate being negatively correlated with using the large code-mixed dataset (725173 tweets) switching, a tweet with no switching - shilpa hi released by (Patro et al., 2017) plus all the tweets ji hi aap hi ravidubey hi jaise hi tuchho hi ko hi (13278) in our three datasets. jawab hi mat hi dijiye hi ye hi log hi aap hi ke hi 5 Results sath hi kabhi hi nahi hi was labeled as non-hateful by the baseline, was detected as hateful by our We compare the baseline models along with (i) methods (BF and HF). the baseline + switching feature-based models and (ii) the HAN models. We use macro-F1 score for 6 Conclusion comparison all through. The main results are sum- In this paper, we identified how switching patterns marized in Table4. The interesting observations can be effective in improving three different NLP that one can make from these results are – (i) inclu- applications. We present a set of nine features that sion of the switching features always improves the improve upon the state-of-the-art baselines. In addi- overall performance of any model (machine learn- tion, we exploit the modern deep learning machin- ing or deep learning) for all the three tasks, (ii) the ery to improve the performance further. Finally, deep learning models are always better than the this model can be improved further by pumping machine learning models. Inclusion of switching the switching features in the final layer of the deep features into the machine learning models (indi- network. cated as BF in Table4) allows us to obtain a macro- In future, we would like to extend this work for F1 improvement of 2.62%, 1.85% and 3.36% over other language pairs. For instance, we have seen the baselines (indicated as B in Table4) on the examples of such switching in English-Spanish10 tasks of humour detection, sarcasm detection and and English-Telugu11 pairs also. Further we plan hate speech detection respectively. Inclusion of to investigate other NLP applications that can ben- the switching feature in the HAN model (indicated efit from the simple linguistic features introduced as HF in Table4) allows us to obtain a macro-F1 here. improvement of 4.9%, 4.7% and 17.7% over the original baselines (indicated as B in Table4) on the tasks of humour detection, sarcasm detection and References hate speech detection respectively. P. Agarwal, A. Sharma, . Grover, M. Sikka, . Rudra, Success of our model: Success of our approach and M. Choudhury. 2017. I may talk in english but is evident from the following examples. For gaali toh hindi mein hi denge : A study of english- instance, as we had demonstrated earlier, hu- hindi code-switching and swearing pattern on social mour is positively correlated with switching, a networks. In 2017 9th International Conference on Communication Systems and Networks (COM- tweet having a switching pattern like - anurag hi SNETS), pages 554–557. kashyap hi can en never en join en aap hi be- 10 cause en ministers en took en oath en,“main hi Don’t forget humourstart la toalla cuando go to le playa humourend; Gloss: Don’t forget the towel when you go to kisi hi anurag hi aur hi dwesh hi ke hi bina hi the beach. kaam hi karunga hi” which was not detected as hu- 11Votes kosam quality leni liquor bottles supply chesina morous by the baseline (B) but was detected so by politicians sarcasmstart ee roju quality gurinchi mat- laduthunnaru sarcasmend; Gloss:Politicians who supplied our models (BF and HF). Note that the author of the low quality liquor bottles for votes are talking about quality above tweet seems to have categorically switched today.

1022 Billal Belainine, Alexsandro Fonseca, and Fatiha Sa- Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Ka- dat. 2016. Named entity recognition and hash- lika Bali, Monojit Choudhury, and Niloy Ganguly. tag decomposition to improve the classification of 2016. Understanding language preference for ex- tweets. In Proceedings of the 2nd Workshop on pression of opinion and sentiment: What do hindi- Noisy User-generated Text (WNUT), pages 102–111, english speakers do on twitter? In Proceedings Osaka, Japan. The COLING 2016 Organizing Com- of the 2016 Conference on Empirical Methods in mittee. Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 1131–1141. Aditya Bohra, Deepanshu Vijay, Vinay Singh, Syed Sarfaraz Akhtar, and Manish Shrivastava. Jeff Siegel. 1995. How to get a laugh in fijian: 2018. A dataset of Hindi-English code-mixed social Code-switching and humor. Language in Society, media text for hate speech detection. In Proceedings 24(1):95–110. of the Second Workshop on Computational Model- ing of People’s Opinions, Personality, and Emotions Sahil Swami, Ankush Khandelwal, Vinay Singh, in Social Media, pages 36–41, New Orleans, Syed Sarfaraz Akhtar, and Manish Shrivastava. 2018. Louisiana, USA. Association for Computational A corpus of english-hindi code-mixed tweets for sar- Linguistics. casm detection. CoRR, abs/1805.11869. Anik Dey and Pascale Fung. 2014. A Hindi-English Mar´ıa Jose´ Garc´ıa Vizca´ıno. 2011. Association humor code-switching corpus. In Proceedings of the Ninth in code-mixed airline advertising. International Conference on Language Resources and Evaluation (LREC’14). Yogarshi Vyas, Spandana Gella, Kalika Bali, and Monojit Choudhury. 2014. Pos tagging of english- Mian Du, Matthew Pierce, Lidia Pivovarova, and Ro- hindi code-mixed social media content. pages 974– man Yangarber. 2014. Supervised classification us- 979. ing balanced training. pages 147–158. Zeerak Waseem, Thomas Davidson, Dana Warmsley, Bo Han, Paul Cook, and Timothy Baldwin. 2012. Au- and Ingmar Weber. 2017. Understanding abuse: A tomatically constructing a normalisation dictionary typology of abusive language detection subtasks. In for microblogs. In Proceedings of the 2012 Joint Proceedings of the First Workshop on Abusive Lan- Conference on Empirical Methods in Natural Lan- guage Online, pages 78–84, Vancouver, BC, Canada. guage Processing and Computational Natural Lan- Association for Computational Linguistics. guage Learning, pages 421–432, Jeju Island, Korea. Association for Computational Linguistics. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Taofik Hidayat. 2012. An analysis of code switching attention networks for document classification. In used by facebookers (a case study in a social net- Proceedings of the 2016 Conference of the North work site). American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Ankush Khandelwal, Sahil Swami, Syed Sarfaraz pages 1480–1489, San Diego, California. Associa- Akhtar, and Manish Shrivastava. 2017. Classifica- tion for Computational Linguistics. tion of spanish election tweets (COSET) 2017 : Clas- sifying tweets using character and word level fea- tures. In Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) co-located with 33th Conference of the Spanish Society for Natural Language Processing (SEPLN 2017), Murcia, Spain, September 19, 2017, pages 49–54. Ankush Khandelwal, Sahil Swami, Syed Sarfaraz Akhtar, and Manish Shrivastava. 2018. Humor de- tection in english-hindi code-mixed social media content : Corpus and baseline system. CoRR, abs/1806.05513. Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Ab- hipsa Basu, Prithwish Mukherjee, Monojit Choud- hury, and Animesh Mukherjee. 2017. All that is English may be Hindi: Enhancing language identi- fication through automatic ranking of the likeliness of word borrowing in social media. In Proceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2264–2274, Copenhagen, Denmark. Association for Computa- tional Linguistics.

1023