Apache Spark Based Big Data Analytics for Social Network Cybercrime Forensics
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF NAIROBI SCHOOL OF COMPUTING AND INFORMATICS MASTER OF SCIENCE IN DISTRIBUTED COMPUTING TECHNOLOGY APACHE SPARK BASED BIG DATA ANALYTICS FOR SOCIAL NETWORK CYBERCRIME FORENSICS BY SIMON MULWA KIIO REG NO: P53/78939/2015 SUPERVISOR: DR. ELISHA O. ABADE Project report submitted in partial fulfillment of the Master of Science Degree in Distributed Computing Technology. NOVEMBER 2017 DECLARATION The project, as presented in this document, is my original work and has not been presented for any other university award. Signature: ___________________________ Date: _______________________ Simon Mulwa Kiio P53/78939/2015 The Project has been submitted in partial fulfillment of the requirements for the Master of Science Degree in Distributed Computing Technology at the University of Nairobi with my approval as the University Supervisor. Signature: _____________________________ Date _________________________ Dr. Elisha O. Abade School of Computing and Informatics i ACKNOWLEDGEMENT I wish to convey my appreciation and special thanks to my supervisor Dr. Elisha O. Abade for his dedication and assistance throughout the research process, the members of the panel whose knowledge and experience in this field has been of great help to my research and the whole School of Computing and Informatics for their support that made me deliver in this work. Special thanks to my family for their love, encouragement and support towards delivery. Lastly, I would like to appreciate my colleagues who supported me to deliver this research project. ii ABSTRACT The anonymity of social networks makes its attractive for cyber criminals to mask their criminal activities online posing a challenge to law enforcers in tracking and uncovering the perpetrators as most evidence is hidden within big data. With this ever-increasing volume of data, forensic analyst faces challenges in investigations involving huge data volumes while at the same time limited by computer processor, memory and storage resources of a single computer node. With increased social media data and the high rate of production, it has become difficult to collect, store and analyze such big data using traditional forensic tools. This study involved the application of apache spark and big data analytic in forensic analysis of social network cybercrimes such as hate speech, cyberbullying and demonstrated the application of data analytics in supplementing the challenges of traditional forensic tools in investigations involving Big Data. The study developed an apache spark based forensic tool to stream and analysis social media data for hate speech and cyberbully cybercrimes while diving to investigate relevant artifacts found on Twitter social network and ways to collect, preserve and ensure authenticity of the evidence. The study employed Naïve Bayes algorithm within Spark ML API to automatically classify and categorize hate speech and cyberbullying found within Twitter social media. The study showed that by generating SHA-256 Hash key for each tweet item within DStreams and storing tweet data together with corresponding Hash key in MongoDB can be used in tweet evidence preservation and authentication. Again, by streaming full tweet Account metadata, the study revealed that such metadata can be used in authenticating the creator, source, date and time for a given hate speech tweet. iii Table of Contents DECLARATION ............................................................................................................................. i ACKNOWLEDGEMENT .............................................................................................................. ii ABSTRACT ................................................................................................................................... iii LIST OF TABLES ........................................................................................................................ vii TABLE OF FIGURES ................................................................................................................. viii LIST OF ABBREVIATIONS ......................................................................................................... x CHAPTER ONE: INTRODUCTION ............................................................................................. 1 Background ..................................................................................................................... 1 Problem Statement .......................................................................................................... 4 Objectives of the Study ................................................................................................... 5 1.3.1 General Objectives ...................................................................................................... 5 1.3.2 Specific Objectives ..................................................................................................... 5 Research Questions ......................................................................................................... 6 Significance of the Study ................................................................................................ 7 Scope of the Study .......................................................................................................... 8 Assumptions and Limitations of the Study ..................................................................... 8 CHAPTER TWO: LITERATURE REVIEW ................................................................................. 9 2 Introduction ............................................................................................................................. 9 Digital Forensics ............................................................................................................. 9 2.1.1 The Digital Forensics Process ................................................................................... 10 Big Data Forensics ........................................................................................................ 11 2.2.1 Big Data Attributes ................................................................................................... 13 2.2.2 Big Data Architecture ............................................................................................... 15 Data Mining and Machine Learning ............................................................................. 16 2.3.1 Data Mining Techniques ........................................................................................... 17 2.3.2 Data Mining Algorithms ........................................................................................... 18 Classification Algorithms ............................................................................................. 19 2.4.1 Naive Bayes (Multinomial) Classifier ...................................................................... 19 2.4.2 Support Vector Machines (SVM) ............................................................................. 20 Apache Hadoop ............................................................................................................. 21 2.5.1 Hadoop Core Components ........................................................................................ 22 iv Apache Spark ................................................................................................................ 24 2.6.1 Spark Streaming ........................................................................................................ 25 2.6.2 Use Cases of Spark/Spark Streaming ....................................................................... 26 MongoDB ..................................................................................................................... 27 2.7.1 MongoDB Document Structure ................................................................................ 28 2.7.2 MongoDB Connector for Spark ................................................................................ 30 Social Networks ............................................................................................................ 30 2.8.1 Social Network Structure .......................................................................................... 31 2.8.2 Social Network Analysis (SNA) ............................................................................... 33 2.8.3 Social Network Sites Forensics ................................................................................ 34 2.8.4 Legal Challenges to Social Media Evidence Authentication .................................... 35 2.8.5 Sentiment Analysis ................................................................................................... 37 Proposed Solution ......................................................................................................... 38 Proposed Apache Spark Forensic Tool Conceptual Model .......................................... 38 Literature Summary ...................................................................................................... 39 CHAPTER THREE: RESEARCH METHODOLOGY ............................................................... 41 3 Introduction ........................................................................................................................... 41 Research Design ............................................................................................................ 41 3.1.1 CRISP-DM Overview ..............................................................................................