Hope Speech Detection in Under-Resourced Kannada Language
Total Page:16
File Type:pdf, Size:1020Kb
Advances in Computational Intelligence manuscript No. (will be inserted by the editor) Hope Speech detection in under-resourced Kannada language Adeep Hande · Ruba Priyadharshini · Anbukkarasi Sampath · Kingston Pal Thamburaj · Prabakaran Chandran · Bharathi Raja Chakravarthi Received: date / Accepted: date Abstract Numerous methods have been developed to English translation of KanHope for additional train- monitor the spread of negativity in modern years by ing to promote hope speech detection. The approach eliminating vulgar, offensive, and fierce comments from achieves a weighted F1-score of 0.756, bettering other social media platforms. However, there are relatively models. Henceforth, KanHope aims to instigate re- lesser amounts of study that converges on embracing search in Kannada while broadly promoting researchers positivity, reinforcing supportive and reassuring con- to take a pragmatic approach towards online content tent in online forums. Consequently, we propose cre- that encourages, positive, and supportive. We have ating an English-Kannada Hope speech dataset, Kan- published the data1 and the corresponding codes2 to Hope and comparing several experiments to bench- support our claims. mark the dataset. The dataset consists of 6,176 user- generated comments in code mixed Kannada scraped Keywords Hope Speech · Code-mixing · Under- from YouTube and manually annotated as bearing hope resourced languages speech or Not-hope speech. In addition, we introduce DC-BERT4HOPE, a dual-channel model that uses the 1 Introduction Adeep Hande The past decade has witnessed tremendous growth in Indian Institute of Information Technology Tiruchirappalli, Tamil Nadu, India social media users, mainly due to more comfortable ac- [email protected] cess to the internet due to the modernization of coun- Ruba Priyadharshini tries worldwide [1]. The surge has also resulted in sev- ULTRA Arts and Science College, Madurai Kamaraj Univer- eral minority groups seeking support and reassurance sity, Madurai, Tamil Nadu, India on social media. The ongoing pandemic has led peo- [email protected] ple to spend more time in their lives on social media Anbukkarasi Sampath to socialize despite social distancing norms [2,3]. How- Kongu Engineering College, Erode, Tamil Nadu, India ever, this poses a severe threat to adolescents and young [email protected] adults who are ardent internet users. Social media ap- Kingston Pal Thamburaj plications such as Facebook, Twitter, YouTube have be- Sultan Idris Education University, Tanjong Malim, Perak, Malaysia come an integral part of their daily lives [4]. While these [email protected] platforms are a boon for youngsters as they can social- Prabakaran Chandran ize more with others, they can also be a bane, which Mu Sigma Inc., Bengaluru, Karnataka, India could be a significant factor for mental health problems [email protected] [5], primarily due to the absence of content moderation Bharathi Raja Chakravarthi* on social media, which often entails offensive, abusive, Insight SFI Research Centre for Data Analytics, Data Science misleading towards a particular group; usually, a mi- Institute, National University of Ireland Galway, Galway, Ire- land 1 Kannada Hope Speech dataset [email protected] 2 KanHope 2 Hande et al nority [6]. Certain ethnic groups or individuals fall vic- The principal contributions of the paper are listed be- tim to manipulating social media to foster destructive low: or disruptive behaviour, a common scenario in cyber- 1. We have created the first dataset in Kannada to bullying [7,8]. There have been several recent develop- detect hope speech in code-mixed Kannada, to alle- ments for hate speech and offensive language detection viate mental health problems on social media. [9]. However, these systems disregard the potential bi- 2. We provide a strong benchmark for the Kannada- ases present in the dataset that they are trained on and English Hope speech dataset. could hurt a specific group of social media users, often 3. We propose DC-BERT4HOPE, a dual-channel lan- leading to gender/racial discrimination among its users guage model based on the architecture of BERT [10,11,12]. that uses the translation of the dataset as additional Consequentially, there is a need to detect hope input for training, performing better in contrast to speech among social media. As Equality, Diversity, and the typical fine-tuned multilingual BERT. Inclusion is an important topic as it also emphasizes the 4. We perform a comprehensive analysis of our models inclusion of a wide variety of people. We define hope as on the dataset along with a thorough error analysis a form of reliance that the existing circumstances will on its predictions on the dataset. change for the better [13]. Several Marginalized groups seek comfort and aid from content on social media that they can feel relatable to and can empathize with oth- 2 Related Work ers’ conditions [14]. These groups usually are people of marginalized communities, such as Lesbian, Gay, Bi- There has been significant research on extracting data sexual, Transgender, Intersex, and Queer, Question- from social media, especially exploiting user comments ing (LGBTIQ) communities, racial and gender minori- on YouTube, Facebook, and Twitter [20,21,22]. Most ties. They perceive social media as one of the sources of the information extracted from social media do not of counselling services, thus improving their emotional follow any grammatical rules and tend to be written states [15,14]. This form of speech is vital to every- in code-mixed, or non-native scripts, which is gener- one as they encourage to improve the quality of life ally observed among users from a multilingual coun- by taking action towards it. Hope speech aims to in- try [18,19,23]. As people use social media platforms spire people battling depression, loneliness, and stress to educate themselves about current affairs, the users’ by assuring promise, reassurance, suggestions, and sup- comments are highly correlated with the events taking port [16]. As most of the social media still revolve place throughout the world. For other under-resourced around English in multilingual communities, the phe- languages, researchers constructed corpora that were nomenon of code-mixing is prevalent in them. Studies manually annotated for two tasks, namely, sentiment have shown that code-mixing is an integral part of social analysis and offensive language detection, in Tamil and media in multilingual countries [17,18]. Code-mixing Malayalam, consisting of 6,739 and 15,744 comments is the phenomenon of interchangeability between two [20,24]. To improve the research in this domain, shared or more languages during a conversation [19]. We ob- tasks were conducted for sentiment analysis [25], and of- serve code-mixing in our corpus, which represents the fensive language detection in Dravidian languages [26]. intrasentential modifications of codes. However, owing Many researchers have made efforts to detect offensive to the limited resources available in Kannada-English language. People can communicate without face-to-face code-mixed text, our primary focus remains on con- interaction on social media, and they are susceptible structing the corpus and conducting experiments to to misunderstandings as they do not consider others’ serve as a benchmark. Our dataset is distinct from perspectives. Offensive speech is often used among so- HopeEDI [14], as that dataset spanned over three lan- cial media forums to dictate others [27,28]. Several guages, namely, English, Tamil, and Malayalam, while deep learning frameworks were developed to classify our dataset focuses more on the dataset construction hate speech into racist, sexist, or neither [29]. For code in code-mixed Kannada-English. While HopeEDI con- mixed languages, researchers created datasets for de- sisted of three classes: Hope, Not-hope, and Other lan- tecting hate speech in code mixed Hindi [30,31]. How- guage, our dataset consist consists of two classes: Hope ever, there is a scarcity of data entailing hope speech de- and Not-Hope. tection. Previously, very few works on hope speech de- Hence, we introduce KanHope, an English-Kannada tection, with the only dataset contribution being a size- code-mixed Hope Speech dataset aiming to minimize able multilingual corpus manually annotated for En- the scarcity in data availability for detecting hope glish, Tamil, and Malayalam, consisting of around 28K, speech in Kannada. 20K, and 10K comments, respectively [14]. Several KanHope 3 other methods to alleviate gender/racial bias in Natu- Sri Lanka [45,46]. Despite its abundance in terms of ral Language processing have been extensively studied speakers, Dravidian Languages are of low resource con- for English [32], And in neural machine translation in cerning language technology [47]. The language is pri- French [33], for equality and diversity. marily spoken by people in Karnataka, India, and is To encourage more research into hope speech for also recognized as an official language of the state [38]. English, Malayalam, and Tamil, the authors conducted The Kannada script, also called Catanese, is an alpha- a shared task on hope speech detection for comments syllabary of the scripts of the Brahmic family evolving scraped from YouTube in these languages [34]. The or- into the Kadamba script [48]. While Kannada is an ganizers for the shared task used the Multilingual hope under-resourced Dravidian language, its scripts write speech dataset, HopeEDI [14]. The corpus consists of other under-resourced languages like Tulu, Konkani, 28,451 sentences in English, while 20,198 sentences in and Sankethi [49]. The Kannada script has 13 vowels Tamil and 10,705 sentences in Malayalam. The authors (14 if the obsolete vowel is included), 34 consonants, of HopeEDI had set the baselines using preliminary ma- and 2 yogavahakas (semiconsonants: part-vowel, part- chine learning algorithms yielding a weighted F1-score consonant) [50,49]. The Kannada language has over of 0.90, 0.56, and 0.73 for English, Tamil, and Malay- 43 Million3 speakers. However, as stated earlier, the alam, respectively, for their test sets in the shared task.