DATA MINING FILE SHARING METADATA a Comparsion Between Random Forests Classificiation and Bayesian Networks
Total Page:16
File Type:pdf, Size:1020Kb
DATA MINING FILE SHARING METADATA A comparsion between Random Forests Classificiation and Bayesian Networks Bachelor Degree Project in Informatics G2E, 22.5 credits, ECTS Spring term 2015 Andreas Petersson Supervisor: Jonas Mellin Examiner: Joe Steinhauer Abstract In this comparative study based on experimentation it is demonstrated that the two evaluated machine learning techniques, Bayesian networks and random forests, have similar predictive power in the domain of classifying torrents on BitTorrent file sharing networks. This work was performed in two steps. First, a literature analysis was performed to gain insight into how the two techniques work and what types of attacks exist against BitTorrent file sharing networks. After the literature analysis, an experiment was performed to evaluate the accuracy of the two techniques. The results show no significant advantage of using one algorithm over the other when only considering accuracy. However, ease of use lies in Random forests’ favour because the technique requires little pre-processing of the data and still generates accurate results with few false positives. Keywords: Machine learning, random forest, Bayesian network, BitTorrent, file sharing Table of Contents 1. Introduction ............................................................................................................................ 1 2. Background ............................................................................................................................ 2 2.1. Torrent ............................................................................................................................. 2 2.1.1. Fake torrents .............................................................................................................. 2 2.2. Torrent discovery site ...................................................................................................... 3 2.3. Machine Learning ............................................................................................................ 3 2.4. Random Forests ............................................................................................................... 4 2.5. Bayesian network ............................................................................................................. 5 2.6. Measuring accuracy ......................................................................................................... 6 2.7. Related work .................................................................................................................... 7 3. Problem definition ................................................................................................................. 9 3.1. Purpose............................................................................................................................. 9 3.2. Motivation ....................................................................................................................... 9 3.3. Objectives......................................................................................................................... 9 4. Method ................................................................................................................................. 10 4.1. Method choices .............................................................................................................. 10 4.1.1. Experiment .............................................................................................................. 10 4.1.2. Literature analysis ................................................................................................... 10 4.1.3. Method choice ......................................................................................................... 10 4.2. Validity threats ............................................................................................................... 11 4.2.1. Conclusion Validity .................................................................................................. 11 4.2.2. Internal validity ....................................................................................................... 11 4.2.3. Construct validity .....................................................................................................12 4.2.4. External validity .......................................................................................................12 4.3. Research ethics ...............................................................................................................12 5. Implementation .....................................................................................................................14 5.1. Literature analysis ...........................................................................................................14 5.2. Experiment .....................................................................................................................14 5.2.1. Collecting the data .................................................................................................... 15 5.2.2. Feature/Attribute selection ...................................................................................... 15 5.2.3. Creating a Bayesian network ....................................................................................16 5.2.4. Creating the prediction models ................................................................................19 5.2.5. Evaluating the performance .................................................................................... 20 6. Evaluation ............................................................................................................................ 22 6.1. Performance with all supported features included ........................................................ 22 6.2. Performance with “Largest file extension” removed. .................................................... 24 6.2. Discussion and conclusions ........................................................................................... 25 7. Conclusions and Future work ............................................................................................... 27 7.1. Summary ........................................................................................................................ 27 7.2. Future work.................................................................................................................... 27 8. References ............................................................................................................................ 28 Appendix A – Validity threats .................................................................................................. 29 Appendix B – Data acquisition scripts and schemas ............................................................... 32 Appendix C – Bayesian Network validation ............................................................................ 42 1. Introduction Copyright infringement (downloading movies or music illegally) using torrent discovery sites is a widespread phenomenon today. A problem with sending takedown for all content with a matching title (e.g. Game of Thrones) is the chance that decoy content is included. For example, content that has similar a similar title but might be anything from an empty media file or a virus (Santos, da Costa Cordeiro, Gaspary & Barcellos, 2010). Helping organisations such as those protecting intellectual property deal with it in a more efficient manner by filtering out decoys could save time and money that could be used elsewhere. Additionally, decoys on file sharing networks are a source of malware (Santos et al., 2010), machine learning might be an interesting approach for web filters in anti-virus solutions to protect users from malicious content on file sharing networks. This work evaluates two techniques for classification (categorizing something based on other information known about it) called Bayesian Networks and Random forests. These two techniques are compared using information about the contents of torrent files to evaluate their accuracy for classifying torrents as fake. This work uses a literature analysis as well as an experiment to compare and contrast the two previously mentioned techniques. Both techniques showed highly accurate results when applied to data from BitTorrent file sharing communities. But the results did not show any statistically significant advantage to using one technique over the other in terms of predictive performance. However, random forests required a lot less pre-processing of the data and was overall easier to use. This work is aimed at developers and project leaders for the types of companies mentioned above (anti-virus and IP rights organizations) to help them decide if machine learning is worth investing time into to integrate it with their other software or tools. This report is structured into six different parts. This introduction, followed by some background (chapter 2) which is then followed by the problem description (chapter 3). After the problem description potential methods (chapter 4) to solve the problem are discussed (as well as threats to the validity for using them). Chapter 5 contains the actual execution of the project and chapter 6 evaluates the results and discusses them. Chapter 7 contains a summary and discusses possible future work. 1 2. Background This chapter will explain some of the main concepts used in this work that are needed