machine learning & knowledge extraction Article Monitoring Users’ Behavior: Anti-Immigration Speech Detection on Twitter Nikolaos Pitropakis 1,* , Kamil Kokot 1, Dimitra Gkatzia 1 , Robert Ludwiniak 1, Alexios Mylonas 2 and Miltiadis Kandias 3 1 School of Computing, Edinburgh Napier University, Edinburgh EH10 5DT, UK;
[email protected] (K.K.);
[email protected] (D.G.);
[email protected] (R.L.) 2 School of Computing and Informatics, Bournemouth University, Poole BH12 5BB, UK;
[email protected] 3 Department of Informatics, Athens University of Economics and Business, 104 34 Athina, Greece;
[email protected] * Correspondence:
[email protected] Received: 25 June 2020; Accepted: 30 July 2020; Published: 3 August 2020 Abstract: The proliferation of social media platforms changed the way people interact online. However, engagement with social media comes with a price, the users’ privacy. Breaches of users’ privacy, such as the Cambridge Analytica scandal, can reveal how the users’ data can be weaponized in political campaigns, which many times trigger hate speech and anti-immigration views. Hate speech detection is a challenging task due to the different sources of hate that can have an impact on the language used, as well as the lack of relevant annotated data. To tackle this, we collected and manually annotated an immigration-related dataset of publicly available Tweets in UK, US, and Canadian English. In an empirical study, we explored anti-immigration speech detection utilizing various language features (word n-grams, character n-grams) and measured their impact on a number of trained classifiers. Our work demonstrates that using word n-grams results in higher precision, recall, and f-score as compared to character n-grams.