Comparative Performance of Machine Learning Algorithms in Cyberbullying Detection: Using Turkish Language Preprocessing Techniques
Total Page:16
File Type:pdf, Size:1020Kb
Comparative Performance of Machine Learning Algorithms in Cyberbullying Detection: Using Turkish Language Preprocessing Techniques Emre Cihan ATESa*, Erkan BOSTANCIb, Mehmet Serdar GÜZELc a Department of Security Science, Gendarmerie and Coast Guard Academy (JSGA), Ankara, Turkey. b Department of Computer Engineering, Ankara University, Ankara, Turkey. c Department of Computer Engineering, Ankara University, Ankara, Turkey. Abstract With the increasing use of the internet and social media, it is obvious that cyberbullying has become a major problem. The most basic way for protection against the dangerous consequences of cyberbullying is to actively detect and control the contents containing cyberbullying. When we look at today's internet and social media statistics, it is impossible to detect cyberbullying contents only by human power. Effective cyberbullying detection methods are necessary in order to make social media a safe communication space. Current research efforts focus on using machine learning for detecting and eliminating cyberbullying. Although most of the studies have been conducted on English texts for the detection of cyberbullying, there are few studies in Turkish. Limited methods and algorithms were also used in studies conducted on the Turkish language. In addition, the scope and performance of the algorithms used to classify the texts containing cyberbullying is different, and this reveals the importance of using an appropriate algorithm. The aim of this study is to compare the performance of different machine learning algorithms in detecting Turkish messages containing cyberbullying. In this study, nineteen different classification algorithms were used to identify texts containing cyberbullying using Turkish natural language processing techniques. Precision, recall, accuracy and F1 score values were used to evaluate the performance of classifiers. It was determined that the Light Gradient Boosting Model (LGBM) algorithm showed the best performance with 90.788% accuracy and 90.949% F1 Score value. Keywords — Cyberbullying, machine learning, social media, crime, natural language processing 1. Introduction With the rapid development of social networks today, the scope of internet use has significantly expanded. We are in a period in which people create a virtual personality on social media for purposes such as expressing their emotions and sharing things with other people (Mulyadi and Fitriana, 2018). *Corresponding author. E-mail addresses: [email protected](E.C.Ates), [email protected](E.Bostanci), [email protected](M.S.Guzel). 1 After the virtual personality formation, which usually starts with membership to social media sites, data types such as text, sound, image and video can be transferred. In addition to data transfer, social media is a communication environment where people can also send messages such as messages, status or comments to each other. The rapid increase in the use of social media over the past years has brought many positive and negative developments (Akdemir and Lawless, 2020; Siddiqui and Singh, 2016). One of the most common negative effects is cyberbullying. Cyberbullying can be defined as the act of intentionally violating another person through technological communication tools (Englander et al., 2017; Peter and Petermann, 2018). Harassment or aggression through e-mails, mobile phones, online games and social media can be considered cyberbullying. Since, most studies focusing on detecting cyberbullying are English-based, the Turkish language based studies are quite insufficient. Turkish language, which is being used in Turkey first and foremost, and also many parts of the world, is a head-final language, and that is one of the main factors that make it difficult to do researches with it (Çakir and Güldamlasioǧlu, 2016; Çöltekin, 2020). For example, particularly in Turkey where Turkish is spoken as a mother tongue, when the rate of social media platforms usage in 2020 on which cyberbullying is made are examined (WeAreSocial, 2020); We see that it's the 6th country on Twitter with its 11.800.000 users, the 5th country on Instagram with its 38,000,000 users, the 10th country on Facebook with its 37,000,000 users, and that it's the 15th country on social media platforms with an average usage time of 2 hours 51 minutes a day. When social media usage statistics in Turkey in particular are examined, it reveals how widespread is the use of social media is in Turkey. Therefore, it has become obligatory to detect cyberbullying in Turkish texts. In this study, it is aimed to compare the performance of different machine learning algorithms in cyberbullying detection by using natural language processing techniques specific to Turkish language. Following the introduction of the study, a background on which the concepts of cyber bullying, machine learning, natural language processing and Turkish language are defined respectively has been prepared. In the related works section, studies in the literature on cyberbullying detection are presented. In the methodology section, information was given about the data used for detecting cyberbullying, pre-processing and machine learning models used. In the result and discussion section, the results obtained after machine learning modelling are shared and compared with other studies in the literature. In the conclusion and future works section, the gains obtained after the study and future work plans to increase cyberbullying detection performance in Turkish texts are presented. 2 2. Background 2.1. Cyberbullying Cyberbullying is the state of aggressive and repetitive negative behaviours using technological devices against another person or group with an imbalance of power (Brailovskaia et al., 2018; Englander et al., 2017). The harm caused by cyberbullying is more serious than physical bullying. Items subject to bullying can be dispersed very quickly on the internet and social media environment. Moreover, parallel to the rapid development of technology, the ability to hide the criminals' identities by different methods in virtual platforms and therefore the difficulty in finding criminals creates a situation that motivates criminals to do such acts. Today, regarding the processing of cyberbullying on social media; some cyberbullying behaviours include sending threatening and vulgar messages, disclosing private personal information, creating online gossip, sending repetitive swearing and insulting messages and humiliating any person or group. Cyberbullying can occur with all genders and at different age groups, and the act of cyberbullying can be related to physical, cultural, racial and even religious prejudices (Ferreira et al., 2016; Muralidharan and La Ferle, 2018). It is known that many people experience physical and mental victimization after cyberbullying behaviour. These behaviours also gradually reduce people's tendency to trust social media (Marzano, 2019). For this reason, detecting cyberbullying on social media is a must more than a need. When the Twitter sample of 2020 is examined, an average of 194,444 tweets were sent in different languages every 1 minute (Lewis and Callahan, 2020). In a 24-hour period sample, the average number of tweets sent reached 279,999,360, and it is almost impossible to physically examine such a large amount of data in different languages. For this reason, today's studies focus primarily on normalization of text with natural language processing techniques and subsequently classification based on machine learning. 2.2. Machine Learning Machine learning is the learning ability of a computer to make decisions on its own, based on the existing data and experiences (Hutter et al., 2019). Decisions to be taken within the scope of machine learning are generally for prediction or classification purposes (Sarkar, 2019). Within the scope of this study, machine learning was used to check whether Twitter posts contain cyberbullying or not. In the context, available data is known as training data. The computer trains the models and classifies test data using learning algorithms. There are different learning methods according to the labelling of the training data (Aggarwal, 2018). What we mean by labelling is the classification of data by human power before machine learning. If the learning model is dependent on tagged data, then this model is supervised learning. Unsupervised learning occurs when training data are not tagged. In unsupervised learning, the 3 machine learns to classify itself according to the similarities and differences between the data. The learning to be applied on the combination of tagged and untagged data is semi-supervised learning. Machine learning algorithms are widely used to detect messages containing cyberbullying in social networks (Rosa et al., 2019). Its usage is in the form of supervised learning in general and in this study. There are different learning methods according to the labelling of the training data, and the algorithms used in this study are explained below. Naive Bayes: There is an assumption in this algorithm that the inputs are independent from each other and of equal importance. Although it seems quite simple in terms of the way it works, successful and fast results are obtained in many examples adapted to real life (Liu et al., 2017; Rogel-Salazar, 2018). The fact that the input data is independent means that each attribute is affected by a certain number of inputs and has no significance among them. This situation is usually almost impossible. In general,