Final Thesis (002).Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Canterbury Christ Church University’s repository of research outputs http://create.canterbury.ac.uk Copyright © and Moral Rights for this thesis are retained by the author and/or other copyright owners. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the copyright holder/s. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the copyright holders. When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given e.g. Day, Ed (2018) The application of machine learning, big data techniques, and criminology to the analysis of racist tweets. Ph.D. thesis, Canterbury Christ Church University. Contact: [email protected] THE APPLICATION OF MACHINE LEARNING, BIG DATA TECHNIQUES, AND CRIMINOLOGY TO THE ANALYSIS OF RACIST TWEETS. by Ed Day Canterbury Christ Church University Thesis submitted for the degree of Doctor of Philosophy 2018 I, Ed Day, confirm that the work presented in this thesis is my own. Where infor- mation has been derived from other sources, I confirm that this has been indicated in the work. Word Count: 85,994 Abstract Racist tweets are ubiquitous on Twitter. This thesis aims to explore the creation of an automated system to identify tweets and tweeters, and at the same time gain a theoretical understanding of the tweets. To do this a mixed methods approach was employed: ma- chine learning was utilised to identify racist tweets and tweeters, and grounded theory and other qualitative techniques were used to gain an understanding of the tweets’ content. 84 million tweets that all contained racist words were collected from Twitter. 84,000 of these were hand annotated as racist or not. The machine learning was performed in a Hadoop cluster, utilising Spark and Hive. To identify racist tweets, systematic comparison of seven different algorithms, and a large number of textual, user-derived and geographical features was performed. New features: time of day and day of week were also evaluated. The 84,000 hand annotated tweets were used as input to the machine learning supervised classification processes. It was found that the combination of support vector machines with hour of day as additional feature was optimal for accuracy (0.93) and AUPRC (0.86). A qualitative exploration of tweets was also performed, including a grounded theory analysis. A novel machine learning system to identify racist accounts was created using metrics from the racist tweets, concepts from the grounded theory and a combination of the two as feature inputs. All three sets of features gave accuracy of at least 0.82. i ii Abstract The ambiguity of the tweets meant they were difficult to classify, for both humans and machines, as to whether the tweeter’s intentions were racist or not, the word ‘nigga’ being particularly problematic. Grounded theory analysis of the tweets showed extremely narrow rhetoric that could be summarised in a single theoretical concept: the defence of the in-group. Acknowledgements Thanks to Abhaya Induruwa and Robin Bryant for their excellent and patient supervision and input. Thanks to Sarah for all her help, and for putting up with this for all these years. Thanks to Harry, Ruby and Mackie for being great kids. Thanks to those who helped with the onerous task of reading thousands of racist tweets: Mike Hewitt, Annalise Ralph, Ant Murray, Jazz Budds and Gayle Lennox. iii Contents List of Figures ix List of Tables xi List of Abbreviations xiii List of Code xv 1 Introduction 1 1.1 Research Aim . 6 1.2 Research Questions . 7 1.3 Structure of the Thesis . 9 2 Literature Review 11 2.1 Machine Learning and Hate Speech Detection . 11 2.2 Chapter Summary . 22 3 Racism and Twitter 23 3.1 Social Networks and Social Media and Twitter . 24 3.2 Race and Racism . 26 3.3 Ethnicity . 28 3.4 Hate Speech . 29 3.4.1 Hate Speech and the Law of England and Wales . 31 3.4.2 CPS and Racism . 36 3.4.3 Hate speech and Other Countries . 38 3.4.4 Definitions of Hate Speech . 41 3.4.5 Difficulties in Interpreting Tweets . 44 3.4.6 Personal Racism . 47 3.5 Policing the Internet . 48 3.6 Chapter Summary . 50 4 Theoretical Background 57 4.1 Cybercrime . 57 4.2 Crime and deviance . 59 4.3 Criminological Perspectives on Crime and Deviance . 60 4.3.1 Dispositional and Situational Theories of Crime and Deviance . 64 v vi CONTENTS 4.3.2 Control Theory . 64 4.3.3 Lifestyle Exposure Theory . 66 4.3.4 Routine Activity Theory and Environmental Criminology . 67 4.3.5 Routine Activities and Cybercrime . 73 4.4 Psychological Perspectives on Online Offending . 81 4.4.1 Disinhibition . 81 4.4.2 Anonymity . 82 4.4.3 Motivation . 84 4.5 Chapter Summary . 86 5 Methodology 91 5.1 Mixed Methods . 91 5.2 Scalability . 95 5.3 Hadoop . 96 5.3.1 HDFS . 97 5.3.2 MapReduce . 98 5.3.3 YARN . 100 5.4 Spark . 102 5.5 Tweets . 104 5.6 Sampling Tweets . 106 5.7 Collecting the Data . 107 5.8 Datasets . 109 5.9 Storing the Data . 110 5.10 Annotating and Classifying the Data . 111 5.11 Imbalanced Data . 113 5.12 Machine Learning . 114 5.12.1 Model Tuning . 114 5.12.2 Evaluation of the Classifiers . 115 5.13 Qualitative Analysis of the Tweet Data . 118 5.13.1 Initial Qualitative Analysis . 121 5.13.2 Grounded Theory Analysis . 123 5.14 Accounts . 127 5.14.1 Machine Learning and Accounts . 133 5.15 Chapter Summary . 135 6 Big Data and Machine Learning 143 6.1 Use of Big Data in Criminology . 143 CONTENTS vii 6.2 This Research and Big Data . 146 6.2.1 Computer Data Storage . 148 6.3 Definitions of Big Data . 150 6.4 Machine Learning . 153 6.4.1 Machine Learning and Text . 156 6.4.2 Pre-processing the Data . 157 6.4.3 Features . 159 6.4.4 Spark Methods . 173 6.4.5 removeRegexUDF . 174 6.4.6 Stanford Core NLP Lemmatization . 174 6.4.7 Spark Extraction Methods . 175 6.4.8 Spark Transformation Methods . 176 6.4.9 Machine Learning Algorithms . 178 6.5 Chapter Summary . 185 7 Results 189 7.1 Machine Learning Results . 190 7.1.1 Choosing Fraction of Negatives . 190 7.1.2 Text, User, Geographical and Temporal Features . 193 7.1.3 Ngrams vs BOW and Hour and Day . 194 7.1.4 Algorithms . 197 7.2 Temporal Results . 200 7.3 Qualitative results . 207 7.3.1 NLTK Results . 207 7.3.2 Word Cloud Results . 210 7.3.3 Word Tree Results . 212 7.3.4 Discursive Analysis Results . 217 7.3.5 Analysis of the Popular Retweets Results . 218 7.3.6 Grounded Theory Results . 227 7.4 Accounts Results . 231 7.4.1 Accounts Machine Learning Results . 234 7.5 Chapter Summary . 235 8 Discussion 243 8.1 Automated Racist Tweet Identification . 244 8.2 Automated Identification of Influential Racist Twitter Accounts . 247 8.3 Qualitative Analysis, Criminology, Psychology and Racist Tweets . 249 viii CONTENTS 8.4 Capable Guardianship and SADRTA . 254 8.5 Limitations . 261 8.6 Original Contribution . 263 8.7 Future Work . 264 Appendices 267 A Additional Code Listings 267 B Record Counts and Metrics Table 299 References 303 List of Figures 1.1 An example tweet from March 2018. 5 4.1 The development of routine activity theory, from Eck and Madensen (2015). 70 5.1 HDFS architecture shows an HDFS client communicating with the master name node and slave data nodes, from Holmes (2012). 97 5.2 MapReduce architecture from Kawa (2014). ..