Towards Offensive Language Identification for Dravidian Languages

Total Page:16

File Type:pdf, Size:1020Kb

Towards Offensive Language Identification for Dravidian Languages Towards Offensive Language Identification for Dravidian Languages Siva Sai Yashvardhan Sharma Birla Institute of Technology and Birla Institute of Technology and Science Pilani, India. Science Pilani, India. f20170779@pilani. yash@pilani. bits-pilani.ac.in bits-pilani.ac.in Abstract In a country like India with multiple native lan- guages, users prefer to use their regional language Offensive speech identification in countries like India poses several challenges due to the in their social media interactions (Thavareesan and usage of code-mixed and romanized variants Mahesan, 2019, 2020a,b). It has also been iden- of multiple languages by the users in their tified that users tend to use roman characters for posts on social media. The challenge of of- texting instead of the native script. This poses a fensive language identification on social me- severe challenge for the identification of offensive dia for Dravidian languages is harder, con- speech, considering the under-developed method- sidering the low resources available for the ologies for handling code-mixed and romanized same. In this paper, we explored the zero- text (Jose et al., 2020; Priyadharshini et al., 2020). shot learning and few-shot learning paradigms based on multilingual language models for of- Until a few years ago, hate and offensive speech fensive speech detection in code-mixed and ro- were identified manually which is now an impos- manized variants of three Dravidian languages sible task due to the enormous amounts of data - Malayalam, Tamil, and Kannada. We pro- being generated daily on social media platforms. pose a novel and flexible approach of selec- The need for scalable, automated methods of hate tive translation and transliteration to reap bet- speech detection has attracted significant research ter results from fine-tuning and ensembling multilingual transformer networks like XLM- from the domains of natural language processing RoBERTa and mBERT. We implemented pre- and machine learning. A variety of techniques trained, fine-tuned, and ensembled versions of and tools like bag of words models, N-grams, XLM-RoBERTa for offensive speech classifi- dictionary-based approaches, word sense disam- cation. Further, we experimented with inter- biguation techniques are developed and experi- language, inter-task, and multi-task transfer mented with by researchers. Recent developments learning techniques to leverage the rich re- in multilingual text classification are led by Trans- sources available for offensive speech identi- former architectures like mBERT (Devlin et al., fication in the English language and to enrich the models with knowledge transfer from re- 2018) and XLM-RoBERTa (Conneau et al., 2019). lated tasks. The proposed models yielded good An additional advantage of these architectures, par- results and are promising for effective offen- ticularly XLM-RoBERTa, is that it yields good sive speech identification in low resource set- results even with lower resource languages and this tings.1 particular aspect is beneficial to Indian languages 1 Introduction which do not have properly established datasets. In our work, we focused on using these architectures Offensive speech is defined as speech that causes in multiple ways. However, there is a caveat in a person to feel upset, resentful, annoyed, or in- directly using the models on the romanized or code- sulted. In recent years, social media such as Twit- mixed text: the transformer models are trained on ter, Facebook and Reddit have been increasingly languages in their native script, not in the roman- used for the propagation of offensive speech and ized script in which users prefer to write online. We the organization of hate and offense-based activi- solve this problem by using a novel way to convert ties (Mandl et al., 2020; Chakravarthi et al., 2020e). the romanized sentences into their native language 1https://github.com/SivaAndMe/TOLIDL_ while preserving their semantic meaning - selective DravidianLangTech_EACL_2021 translation and transliteration. 18 Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pages 18–27 April 20, 2021 ©2021 Association for Computational Linguistics This work is an extension of (Sai and Sharma, used an ensemble of BERT models with different 2020). Our important contributions are as follows random seeds for aggression detection on social 1) Proposed selective translation and transliteration media text, which is the inspiration behind our en- for text conversion in romanized and code-mixed sembling strategy with XLM-RoBERTa models. settings which can be extended to other romanized There has been less research on text classification and code-mixed contexts in any language. 2) Exper- in Dravidian languages and there is no research in imented and analyzed the effectiveness of finetun- offensive speech identification for Dravidian lan- ing and ensembling of XLM-RoBERTa models for guages so far. (Thomas and Latha, 2020) uses a offensive speech identification in code-mixed and simple LSTM model for sentiment analysis in the romanized scripts. 3) Investigated the efficacy of Malayalam language. Our work adresses this gap inter-language, inter-task, and multi-task transfer of less research in offensive speech identification learning techniques and demonstrated the results methods for Dravidian languages and the systems with t-SNE (Maaten and Hinton, 2008) visualiza- proposed can be extended to other Indian and for- tions. eign languages as well. The rest of the paper is organized as follows: In section 2, we discuss related work followed by 3 Datasets datasets’ description in section 3. Section 4 de- scribes methodology and section 5 discusses about The datasets used in this work were taken results. Finally, we conclude the paper in section 6. from three competitions - HASOC-Dravidian- CodeMix - FIRE 2020 (Chakravarthi et al., 2 Related works 2020a,c,b,d; Chakravarthi, 2020) (henceforth HDCM), Sentiment Analysis for Dravidian Lan- Two major shared tasks organized in offensive lan- guages in Code-Mixed Text (Chakravarthi et al., guage identification are OffensEval 2019 (Zampieri 2020a,d)(henceforth SADL) and Offensive Lan- et al., 2019) and GermEval (Struß et al., 2019). guage Identification in Dravidian Languages Other tasks related to offensive speech identifica- (Chakravarthi et al., 2021; Hande et al., 2020; tion include HASOC-19 (Mandl et al., 2019) which Chakravarthi et al., 2020d,a)(henceforth ODL). As dealt with hate speech and offensive content identi- a part of HDCM, there were two binary classi- fication in Indo-European languages, TRAC-2020 fication tasks, with the second task having two (Kumar et al., 2020), which dealt with aggression subtasks. The objective of all of the tasks was identification in Bangla, Hindi and English. While the same: given a Youtube comment, classify HASOC-19 and TRAC 2020 dealt with offensive it as offensive or not offensive. But the format speech identification in Indian languages of Bangla and language of data provided to different tasks and Hindi, HASOC-Dravidian-CodeMix - FIRE were different: Code-mixed Malayalam for Task- 2020 (Chakravarthi et al., 2020b) is the first shared 1(henceforth referred to as Mal-CM), Tanglish for task to conduct offensive speech identification task Task- 2a(henceforth referred to as Tanglish), and in Dravidian languages. Manglish for Task-2b(henceforth referred to as Researchers used a wide variety of techniques Manglish). From ODL, we used the Kanglish for the identification of offensive language. Re- dataset provided. Originally, the task proposed cently, (Ranasinghe and Zampieri, 2020)(work pub- in ODL is a multi-class classification problem with lished after the first version of this paper was sub- multiple offensive categories. However, to main- mitted) used the XLM-RoBERTa model for of- tain uniformity among the datasets, we divided fensive language identification in Bengali, Hindi, the classes into two - offensive and not offensive. and Spanish. They show that the XLM-R model Offensive-Targeted-Insult-Individual, Offensive- beats all other previous approaches for offensive Targeted-Insult-Group, Offensive-Untargeted, and language detection. (Saha et al., 2019), used the Offensive-Targeted-Insult-Other classes in the LGBM classifier on top of the combination of mul- dataset are renamed as offensive, and all the non- tilingual BERT and LASER pre-trained embed- Kannada posts are removed from the dataset. As far dings in HASOC-19. (Mishra and Mishra, 2019) as SADL is concerned, we used the given sentiment fine-tuned monolingual and multilingual BERT analysis datasets as helper datasets in inter-task based network models to achieve good results in transfer learning technique(4.5). OLID (Zampieri identifying hate speech. (Risch and Krestel, 2020) et al., 2019) is a dataset for offensive language iden- 19 Task Train Dev Test tweets where users tend to write informally using OFF NOT multiple languages. Also, translating romanized Mal-CM 567 2633 400 400 non-English language words into that particular language does not make any sense in our context. Tanglish 1980 2020 - 940 For example, translating the word ”Maram”(which Manglish 1952 2048 - 951 means tree in Tamil) directly into Tamil would se- Kanglish 1151 3544 586 778 riously affect the entire sentence’s semantics in OLID 4400 8840 - 860 which the word is present because ”Maram” will be treated as an English word. In many cases, Table 1: Statistics of the offensive speech datasets used valid translations from English to a non-English in this work. language would not be available for words. So, as a solution to this problem, we propose tification in the English language and we used it as selective transliteration and translation of the text. a helper dataset for inter-language transfer learn- In effect, this process of conversion of romanized ing technique (4.4). It consists of 14,100 tweets or code-mixed text(for example, Tanglish) is to as stated before; each tweet is annotated as offen- transliterate the romanized native language(Tamil) sive or not-offensive(subtask-A). The statistics of words in the text into Tamil and translate the En- the offensive speech datasets used in this work are glish words in the text into Tamil selectively.
Recommended publications
  • Tamil-English Mixed Language Used in Tamilnadu
    The International Journal of Language Society and Culture Editors: Thao Lê and Quynh Lê URL: www.educ.utas.edu.au/users/tle/JOURNAL/ ISSN 1327-774X Tamil-English Mixed Language Used in Tamilnadu K. Kanthimathi SDNB Vaishnav College Chennai Abstract In multilingual societies where several languages are used by different speech communities speakers tend to know two or more than two languages. People who live in a bilingual or multilingual communi- cation environment usually have the tendency to use two or more codes while communicating with each other. People not only speak different languages/codes but also mix the languages/codes known to them. Code mixing is used as a linguistic device in informal styles of speaking. This paper looks into the features of code mixed Tamil and English used by many people in Tamilnadu. Introduction Code mixing of the mother tongue and English is a common speech behaviour used by bilingual people in India. This field of research has been investigated by several linguists (Verma, 1969, 1976; Kachru, 1994; Annamalai, 1978, 2001; Sridhar, 1978; Pandharipande, 1983; Vaid, 1980; Singh, 1985, 1995). Language experts have studied the causes, functions, characteristics and effects code mixing. Such investigations have revealed the sociolinguistic, structural and psycholinguistic aspects of these language contact phenomena. The yearning to use English is becoming a universal phenomenon. English is associated with the ide- ology of modernity and progress, and the native languages with the ideology of tradition and cultural values. In the local Tamil media, such as Television and Radio, we find the use of many English words.
    [Show full text]
  • Learn Thai Language in Malaysia
    Learn thai language in malaysia Continue Learning in Japan - Shinjuku Japan Language Research Institute in Japan Briefing Workshop is back. This time we are with Shinjuku of the Japanese Language Institute (SNG) to give a briefing for our students, on learning Japanese in Japan.You will not only learn the language, but you will ... Or nearby, the Thailand- Malaysia border. Almost one million Thai Muslims live in this subregion, which is a belief, and learn how, to grow other (besides rice) crops for which there is a good market; Thai, this term literally means visitor, ASEAN identity, are we there yet? Poll by Thai Tertiary Students ' Sociolinguistic. Views on the ASEAN community. Nussara Waddsorn. The Assumption University usually introduces and offers as a mandatory optional or free optional foreign language course in the state-higher Japanese, German, Spanish and Thai languages of Malaysia. In what part students find it easy or difficult to learn, taking Mandarin READING HABITS AND ATTITUDES OF THAI L2 STUDENTS from MICHAEL JOHN STRAUSS, presented partly to meet the requirements for the degree MASTER OF ARTS (TESOL) I was able to learn Thai with Sukothai, where you can learn a lot about the deep history of Thailand and culture. Be sure to read the guide and learn a little about the story before you go. Also consider visiting neighboring countries like Cambodia, Vietnam and Malaysia. Air LANGUAGE: Thai, English, Bangkok TYPE OF GOVERNMENT: Constitutional Monarchy CURRENCY: Bath (THB) TIME ZONE: GMT No 7 Thailand invites you to escape into a world of exotic enchantment and excitement, from the Malaysian peninsula.
    [Show full text]
  • Researcher 2015;7(8)
    Researcher 2015;7(8) http://www.sciencepub.net/researcher “JANGLISH” IS CHEMMOZHI?...(“RAMANUJAM LANGUAGE”) M. Arulmani, B.E.; V.R. Hema Latha, M.A., M.Sc., M. Phil. M.Arulmani, B.E. V.R.Hema Latha, M.A., M.Sc., M.Phil. (Engineer) (Biologist) [email protected] [email protected] Abstract: Presently there are thousands of languages exist across the world. “ENGLISH” is considered as dominant language of International business and global communication through influence of global media. If so who is the “linguistics Ancestor” of “ENGLISH?”...This scientific research focus that “ANGLISH” (universal language) shall be considered as the Divine and universal language originated from single origin. ANGLISH shall also be considered as Ethical language of “Devas populations” (Angel race) who lived in MARS PLANET (also called by author as EZHEM) in the early universe say 5,00,000 years ago. Janglish shall be considered as the SOUL (mother nature) of ANGLISH. [M. Arulmani, B.E.; V.R. Hema Latha, M.A., M.Sc., M. Phil. “JANGLISH” IS CHEMMOZHI?...(“RAMANUJAM LANGUAGE”). Researcher 2015;7(8):32-37]. (ISSN: 1553-9865). http://www.sciencepub.net/researcher. 7 Keywords: ENGLISH; dominant language; international business; global communication; global media; linguistics Ancestor; ANGLISH” (universal language) Presently there are thousands of languages exist and universal language originated from single origin. across the world. “ENGLISH” is considered as ANGLISH shall also be considered as Ethical dominant language of International business and global language of “Devas populations” (Angel race) who communication through influence of global media. If lived in MARS PLANET (also called by author as so who is the “linguistics Ancestor” of EZHEM) in the early universe say 5,00,000 years ago.
    [Show full text]
  • Arxiv:2009.12534V2 [Cs.CL] 10 Oct 2020 ( Tasks Many Across Progress Exciting Seen Have We Ilo Paes Akpetandde Language Deep Pre-Trained Lack Speakers, Billion a ( Al
    iNLTK: Natural Language Toolkit for Indic Languages Gaurav Arora Jio Haptik [email protected] Abstract models, trained on a large corpus, which can pro- We present iNLTK, an open-source NLP li- vide a headstart for downstream tasks using trans- brary consisting of pre-trained language mod- fer learning. Availability of such models is criti- els and out-of-the-box support for Data Aug- cal to build a system that can achieve good results mentation, Textual Similarity, Sentence Em- in “low-resource” settings - where labeled data is beddings, Word Embeddings, Tokenization scarce and computation is expensive, which is the and Text Generation in 13 Indic Languages. biggest challenge for working on NLP in Indic By using pre-trained models from iNLTK Languages. Additionally, there’s lack of Indic lan- for text classification on publicly available 1 2 datasets, we significantly outperform previ- guages support in NLP libraries like spacy , nltk ously reported results. On these datasets, - creating a barrier to entry for working with Indic we also show that by using pre-trained mod- languages. els and data augmentation from iNLTK, we iNLTK, an open-source natural language toolkit can achieve more than 95% of the previ- for Indic languages, is designed to address these ous best performance by using less than 10% problems and to significantly lower barriers to do- of the training data. iNLTK is already be- ing NLP in Indic Languages by ing widely used by the community and has 40,000+ downloads, 600+ stars and 100+ • sharing pre-trained deep language models, forks on GitHub.
    [Show full text]
  • Relationship Between Learner Variables and Malaysian Students' Willingness to Communicate in the Esl Classroom Upm
    UNIVERSITI PUTRA MALAYSIA RELATIONSHIP BETWEEN LEARNER VARIABLES AND MALAYSIAN STUDENTS' WILLINGNESS TO COMMUNICATE IN THE ESL CLASSROOM UPM KOGGILA CHANDRA SEGAR COPYRIGHT © FBMK 2018 23 RELATIONSHIP BETWEEN LEARNER VARIABLES AND MALAYSIAN STUDENTS' WILLINGNESS TO COMMUNICATE IN THE ESL CLASSROOM UPM By KOGGILA CHANDRA SEGAR COPYRIGHT Thesis Submitted to the School of Graduate Studies, Universiti Putra Malaysia, in Fulfilment of the Requirements for the Degree of Master of Arts © June 2018 All material contained within the thesis, including without limitation text, logos, icons, photographs and all other artwork, is copyright material of Universiti Putra Malaysia unless otherwise stated. Use may be made of any material contained within the thesis for non-commercial purposes from the copyright holder. Commercial use of material may only be made with the express, prior, written permission of Universiti Putra Malaysia. Copyright © Universiti Putra Malaysia UPM COPYRIGHT © Abstract of thesis is presented to the Senate of Universiti Putra Malaysia in fulfilment of the requirement for the degree of Master of Arts RELATIONSHIP BETWEEN LEARNER VARIABLES AND MALAYSIAN STUDENTS’ WILLINGNESS TO COMMUNICATE IN THE ESL CLASSROOM By KOGGILA CHANDRA SEGAR June 2018 UPM Chairman : Ramiza Binti Darmi, PhD Faculty : Modern Languages and Communication The four language skills of listening, speaking, reading and writing are all interconnected. Proficiency in each skill is needed to become an efficient communicator. Furthermore, the ability to communicate fluently provides speaker with various benefits. This dissertation evaluates the willingness to communicate (WTC) in English Language among Malaysian students. WTC is the most basic orientation towards communication. Almost anyone is likely to respond to a direct question, but many will not continue or initiate interaction.
    [Show full text]
  • Fenomena Munculnya Interlanguage (Inglish) Di Indonesia
    FENOMENA MUNCULNYA INTERLANGUAGE (INGLISH) DI INDONESIA Rosita Ambarwati FPBS IKIP PGRI Madiun Abstrak The process of learning a new language is difficult. Even so, when the second language is finally formed, the language would have a continuous effect on the person’s mother tongue ability (Association for Psychological Science, 2009). On the other side, someone who is learning a new language, would also have trouble to understand the grammar in translation. In the translation skill, they move from the original language to the literal gloss before it reaches the new language (Saygin, 2001). Both sides show the same symptom, the birth of new terms that are actually combinations from both language elements. Some nations, suffer some sort of desperation where it is so difficult to learn English that leads them to a compromise. The compromise gave birth to numerous and vary new vocabularies, and almost can be recognizable as a language. Key words : Interlanguage, Inglish Pendahuluan Belajar bahasa baru itu sulit. Semakin sulit seiring meningkatnya usia. Walau demikian, saat bahasa kedua telah terwujud, bahasa tersebut akan berpengaruh sinambung pada kemampuan seseorang berbahasa asli (Association for Psychological Science, 2009). Di sisi lain, seorang yang mempelajari bahasa baru, akan mengalami kesulitan memahami grammar dan menterjemahkan. Dalam ilmu penerjemahan, mereka berangkat dari bahasa asli menuju ke literal gloss sebelum sampai ke bahasa baru tersebut (Saygin, 2001). Kedua sisi menunjukkan gejala yang sama, munculnya sekumpulan istilah yang merupakan perpaduan dari unsur-unsur kedua bahasa. Sebagian bangsa, mengalami sebuah keputusasaan, begitu sulitnya mempelajari bahasa Inggris sehingga membawa mereka pada kompromi. Kompromi ini memunculkan kosakata yang luar biasa banyak dan beragam, yang hampir dapat diakui sebagai bahasa.
    [Show full text]
  • ECFG-Malaysia-Mar-19.Pdf
    About this Guide This guide is designed to help prepare you for deployment to culturally complex environments and successfully achieve mission objectives. The fundamental information it contains will help you understand the decisive cultural dimension of your assigned location and gain necessary skills to achieve Malaysia mission success. The guide consists of two parts: Part 1: Introduces “Culture General,” the foundational knowledge you need to operate effectively in any global environment – Southeast Asia in particular (Photo: Malaysian, Royal Thai, and US soldiers during Cobra Gold 2014 exercise). Culture Guide Part 2: Presents “Culture Specific” information on Malaysia, focusing on unique cultural features of Malaysian society. This section is designed to complement other pre-deployment training. It applies culture- general concepts to help increase your knowledge of your assigned deployment location (Photo: US sailor signs autographs for Malaysian school children). For further information, visit the Air Force Culture and Language Center (AFCLC) website at www.airuniversity.af.edu/AFCLC/ or contact the AFCLC Region Team at [email protected]. Disclaimer: All text is the property of the AFCLC and may not be modified by a change in title, content, or labeling. It may be reproduced in its current format with the expressed permission of AFCLC. All photography is provided as a courtesy of the US government, Wikimedia, and other sources as indicated. GENERAL CULTURE PART 1 – CULTURE GENERAL What is Culture? Fundamental to all aspects of human existence, culture shapes the way humans view life and functions as a tool we use to adapt to our social and physical environments.
    [Show full text]
  • Malaysian Undergraduates' Attitudes Towards Localised English
    GEMA Online® Journal of Language Studies 80 Volume 18(2), May 2018 http://doi.org/10.17576/gema-2018-1802-06 Like That Lah: Malaysian Undergraduates’ Attitudes Towards Localised English Debbita Tan Ai Lin [email protected] School of Languages, Literacies & Translation Universiti Sains Malaysia Lee Bee Choo [email protected] School of Languages, Literacies & Translation Universiti Sains Malaysia Shaidatul Akma Adi Kasuma [email protected] School of Languages, Literacies & Translation Universiti Sains Malaysia Malini Ganapathy (Corresponding Author) [email protected] School of Languages, Literacies & Translation Universiti Sains Malaysia ABSTRACT Native-like English use is often considered the standard to be achieved, in contrast to non- native English use. Nonetheless, localised English varieties abound in many societies and the growth or decline of any language variety commonly depends on how it is perceived; for instance, as a mere tool for functionality or as a prized cultural badge, and only its users can offer us insights into this. The thrust of the present study falls in line with the concept of language vitality, which is basically concerned with the sustainability of non-global languages. This paper first explores the subject of localisation and English varieties, and then examines the attitudes of Malaysian undergraduates towards their English pronunciation and accent, as well as their perceptions of Malaysian English. A 26-item questionnaire created by the researchers was utilised to collect data. It was also tested for reliability, with returned values indicating good internal consistency for all constructs, making the instrument a reliable option for use in future studies. A total of 253 undergraduates from a public university responded to the questionnaire and results revealed that overall, the participants valued their local-accented English and the functionality of Malaysian English, but regarded this form of the language as substandard.
    [Show full text]
  • 1St INTERNATIONAL CONFERENCE
    1st INTERNATIONAL CONFERENCE Institute of Linguistics, Russian Academy of Sciences Moscow, April 9-10, 2018 URBAN LINGUISTIC DIVERSITY 1st INTERNATIONAL CONFERENCE Institute of Linguistics, Russian Academy of Sciences April 9-10, 2018, Moscow Urban Linguistic Diversity: Materials of the 1st International conference / Reviewed by Vladimir M.Alpatov, Natalya V.Vasilyeva. – Moscow: Institute of Linguistic RAS, 2018. – 52 pp. ISBN 978-5-6041117-0-3 Design: Jury Koryakov, Marina Raskladkina Copyright © 2018 by authors. CONTENTS INTRODUCTION ............................................................................................................................................................ 5 Urban variants of Russian ............................................................................................................................................... 7 Vladimir Belikov The civic university and urban language diversity: Multilingual Manchester as a model for participatory research ............................................................................................................................................................................. 8 Yaron Matras Why cities matter in Sociolinguistics .............................................................................................................................. 9 Dick Smakman Ethnolinguistic problems in Russian republics: Social and educational aspects ..................................................... 11 Ekaterina Arutyunova Unwelcomed and invisible: migrants’
    [Show full text]
  • Chapter 1 Introduction 1.0 Background of the Study Malaysia
    Chapter 1 Introduction 1.0 Background of the Study Malaysia opens her doors to students from various parts of the world with the mushrooming of many private institutions of higher learning and the move to internationalize her many public universities. As such, there is a steady flow of foreign students who made their way to the Malaysian institutions of higher learning not only because of her reputable standard in education but also due to the fact that its education fees are cheaper compared to that in the United Kingdom and United States and that in terms of culture, Malaysia provides a more conducive environment especially to students from Muslim countries. The aftermath of September 11, 2001 has great percussions on many decisions that pertain to Islamic issues and the Muslims. Receiving education from a Muslim country like Malaysia is one of the decisions brought about by the devastating episode. Iranians have also turned to Malaysia to obtain higher education and many of them are enrolled in the masters and Phd programs offered by both the public and the private universities in Malaysia. These programs are offered in the English language and as most Iranians learn English as a foreign language, they encounter problems coping with their studies. As such, they have to prepare themselves to master the language in order to overcome language problems in their studies. How can we master a language? How can we conform to the language conventions in a foreign context? Why do only a few learners become near-native speakers despite 1 making many efforts? Prior to answer these questions, we should pay attention to what make a language foreign to us.
    [Show full text]
  • Offensive Language Identification in Dravidian Code Mixed Social Media
    Offensive language identification in Dravidian code mixed social media text Sunil Saumya1, Abhinav Kumar2 and Jyoti Prakash Singh2 1Indian Institute of Information Technology Dharwad, Karnataka, India 2National Institute of Technology Patna, Bihar, India [email protected], [email protected], [email protected] Abstract that the automation of hate speech detection is a more reliable solution. (Davidson et al., 2017) ex- Hate speech and offensive language recog- nition in social media platforms have been tracted N-gram TF-IDF features from tweets us- an active field of research over recent years. ing logistical regression to classify each tweet in In non-native English spoken countries, so- hate, offensive and non-offensive classes. Another cial media texts are mostly in code mixed or model for the detection of the cyberbullying in- script mixed/switched form. The current study stances was presented by (Kumari and Singh, 2020) presents extensive experiments using multiple with a genetic algorithm to optimize the distinguish- machine learning, deep learning, and trans- ing features of multimodal posts. (Agarwal and fer learning models to detect offensive content Sureka, 2017) used the linguistic, semantic and on Twitter. The data set used for this study are in Tanglish (Tamil and English), Man- sentimental feature to detect racial content. The glish (Malayalam and English) code-mixed, LSTM and CNN based model for recognising hate and Malayalam script-mixed. The experimen- speech in the social media posts were explored by tal results showed that 1 to 6-gram character (Kapil et al., 2020). (Badjatiya et al., 2017) ex- TF-IDF features are better for the said task.
    [Show full text]
  • Globalização E Expansão Cons- Cienciológica Através Dos Idiomas
    302 Temas da Conscienciologia Globalização e Expansão Cons- cienciológica Através dos Idiomas Globalization and Conscientiological Expansion through Languages Globalización y Expansión Concienciológica a través de los Idiomas Luis Minero* * Graduado em Química. Pesquisador e Resumo: Diretor Administrativo da IAC. Este artigo apresenta um estudo da relação entre a globalização e a expan- [email protected] são e integração da Conscienciologia através de vários idiomas. Diferentes ca- ......................................................... racterísticas da globalização são apresentadas, fazendo-se a correlação com a Conscienciologia. Esta pesquisa sugere que, com a globalização, alguns valo- res regionais e idiomas menores ficam marginalizados e desaparecem, enquanto Palavras-chave outros se expandem. Uma análise básica da estrutura e característica dos idio- Conscienciologia mas é apresentada, evidenciando como as pessoas entendem o mundo em que Cultura vivem e, ao mesmo tempo, como as linguagens as condicionam. Através dos Globalização paralelos estudados entre a globalização e a análise dos idiomas, este trabalho Idiomas pretende contribuir para a criação de novas associações de idéias, enfatizando Poliglotismo a importância de se investir na prática do poliglotismo e na conscientização dos seus efeitos, concluindo que neste momento da globalização, e dentro do con- Keywords texto do Estado Mundial, as conscins mais lúcidas não podem se permitir estar Conscientiology restritas por nenhuma limitação idiomática. Culture Abstract: Globalization This article presents a study in regards to the relation between globalization Languages and the expansion and integration of conscientiology through different languages. Polyglotism Different characteristics of globalization are presented and correlated with conscientiology. This research suggests that, with globalization, some regional Palabras-clave values and less expressive languages become marginalized and eventually Concienciología disappear, while other languages expand.
    [Show full text]