Predict Malicious Torrents Online

UCSD Winter 16 CSE 227 • Final Project Report Predict Malicious Torrents online

Jeremy Blackstone, Kareem Kamel, Frederik Nygaard and Tong Jiang

University of California, San Diego { jeremymccallblackstone, kamel.kareem, frederikny, jiangtong0824 }@gmail.com

Abstract

Torrent has become a more and more popular way for users to share large files. Since files are break to pieces when downloaded from a torrent, it makes it possible for malicious files to be hidden behind torrents. However, there is limited research on how to detect potential malicious torrents. In this project, we create a dataset of torrents with verified label and malicious label. We also tried to applied different machine learning models on it to automatically classify the torrents. Our best result can reach a recall rate of 95%.

I. Introduction scanning individual torrents. We make use of a torrent file’s metadata and a number Torrents have become a popular means for of machine learning techniques to identify downloading music, open source software, fake/malicious torrents. This paper makes movies and large files. In fact, according to Bit- two main contributions. First, we created a Torrent the total number of users is estimated dataset with more than 6000 torrents from the at more than a quarter billion. Torrents are website Kickass Torrents. For each torrent in especially convenient when you want to trans- our dataset, we collected features such as the fer large files across the internet. BitTorrent number of seeds, the number of leeches, the is the communication protocol used to trans- size of the file, the number of files and the cre- fer torrent files. More formally, BitTorrent is ation day of the torrent. Also, based on the a peer-to-peer communications protocol P2P information provided by Kickass Torrents, we file sharing protocol used to distribute large identified more than 3000 verified torrents, as amounts of data over the Internet. Before a file well as more than 500 malicious torrents. Sec- can be sent, it is first segmented into smaller ond, we have tried to apply different machine pieces (less than 512kB) and each piece is pro- learning models on this dataset, aimed to auto- tected by cryptographic hash. Users (peers) matically detect the malicious torrents. We get will provide different pieces of the file to each a best recall rate of 91% with a linear regression other. model. Since BitTorrent is a decentralized protocol, In section 3, we briefly introduced how we users can upload malicious files and distribute create our dataset. In section 4, we describe them across the network. Once downloaded, the models we used to solve the problem of these malicious scripts will run on the user’s malicious torrents detection. And, in section local machine and can launch an array of at- 5, we give a discussion and conclusion on the tacks including but not limited to DDOS bots, results of our project. trojans, worms among others. Anti-piracy orga- nizations have been known to upload corrupt II. Related Work (decoy) torrents, a technique known as torrent poisoning, to gather IP addresses of download- While some work has examined malware prop- ers. This malicious activity poses a risk to the agation, little work has been done in identify- freedom and integrity of the BitTorrent com- ing the malicious torrent files themselves. FUN- munity. NEL was able to reduce the propagation of fake Our project intends to identify malicious torrents by automatically adjusting the number torrent files without actually downloading and of concurrent downloads according to the dif-

1 UCSD Winter 16 CSE 227 • Final Project Report ferences between positive and negatives votes only consider the attack in which Eve uploads torrents receive from users[4]. Hatahet, Bouab- a fake/malicious torrent to the site. dallah and Challal identifiy the vulnerabilities As previously mentioned, torrent sites like an active worm can exploit and develop a math- Pirate Bay and Kickass Torrents contain fake ematical model for how they can propagate or malicious torrents that have been uploaded throughout BitTorrent but provide no means by attackers. An attacker, Eve is able to upload for actually detecting or containing worms[1]. a fake/malicious torrent to torrent site unde- There is a tool called TorrentGuard[2] which tected by the site moderators. The torrent site identifies uploaders of fake/malicious torrents requires at least T hours to determine if a tor- by their IP addresses but it does not actually rent is suspect. Since MITM attacks are outside identify any features of the malicious torrents the scope of this paper, we assume the node themselves. This allows the malicious upload- receives an authentic file, i.e. the integrity of ers to react by hiding their IP addresses or the file has not been compromised. Since Eve simply using multiple IP addresses. Creating is able to upload torrents, Eve has complete a classifier to recognize malicious torrent files control over the file contents of the torrents allows for detection of fake/malicious torrents it uploads. This means that Eve has control by any uploader irrespective to their IP address over the following features: number of files, size, and contents of the torrent. We assume III. Threat Model that Eve is cannot manipulate the number of leeches, number of seeds and age of the torrent. Let us look at an example of how a torrent on site (e.g. Kickass torrents) is downloaded IV. Problem Statement by a user, Alice. Using a BitTorrent client, Alice is able to download torrents listed on the Therefore, with this threat model in mind, we torrent site. An attacker Eve can compromise define the problem we wish to tackle. We want the integrity, confidentiality and availability of to determine with reasonable accuracy which the BitTorrent system by launching active and torrents are fake/malicious without download- passive attacks: ing the file and scanning it with an anti-virus. Each torrent file listed contains the following Active Attacks: metadata; torrent name, the URL of the page containing the download link for the torrent, • Eve uploads a fake/malicious torrent con- the number of uploaders (seeds) available, taining a virus, malware, or files that do the number of hosts downloading the torrent not fit the description. (leeches), the size of the torrent, the number • Eve can intercept and modify a torrent us- of files to be downloaded by the torrent and ing a Man in the Middle Attack (MITM). how long the torrent has been up (age). More formally, given a torrent file f, and its metadata • Eve can launch a denial of service attack M f [name, torrentLink, seed, leech, size, f iles, age], (DoS) on the source or destination nodes determinewichfilesfmaybefake/malicious.Theclassifierwillusetwolabels, f ake/maliciousandmalignant. in the BitTorrent network.

Passive Attacks: V. Dataset • Eve can intercept and pieces of the file 1. Verified torrents compromises the confidentiality of the BitTorrent system. Verified torrents are torrents that we assume to be legitimate, having content that matches Each piece of the file contains a cryptographic the description and not containing malware or hash to verify its integrity. In this paper, we viruses. A torrent is considered verified if the

2 UCSD Winter 16 CSE 227 • Final Project Report moderators manually inspect and approve the torrent name to search for the torrent. If there file or if it is uploaded by a verified uploader. were no results, then we assumed the torrent Verified uploaders are users on the site that was deleted. Unfortunately, this method was have undergone a stringent process, must fol- unsuccessful because the torrent-name feature low additonal rules and are recognized by the is not unique for each torrent and one torrent’s site community for consisntently uploading ac- name might be the start of another torrents ceptable torrents. Additionally, if the verified name. In order to get more conclusive results, uploaders violate any of the rules, they are we looped through the 90,000 torrents again in danger of losing their verified status and doing a name search with an additional test us- possibly their entire account on the site. We ing the torrent-link feature. If the torrent name classified torrents as verified if either the ver- search was either empty, or the name search ified_torrent or verified_author features were resulted in a torrent with a different link, we set to true. This gave us an result of 58,600 considered the torrent to be deleted. However, verified torrents. this still resulted in many false positives. For many torrents, if we sent a search query for its 2. Deleted torrents name on KickassTorrents it will say there are no results. On the other hand, if we clicked We used deleted torrents to determine whether on the same torrent’s link directly, we found or not the torrent was fake or malicious. To the torrent’s download page was still present. get the dataset of deleted torrents, we used the Given these difficulties, we chose to find an- first dataset we collected (as mentioned in the other method for collecting deleted torrents. next subsection), waited a week and checked to determine if moderators had deleted the 5. Parsing the HTML torrents. After further research, we discovered that the 3. Kickass API HTML-title for that page contained: "Torrent was deleted". Using this information we iter- In order to establish ground truth for our mod- ated through the 90,000 torrent links, parsed els, we needed to collect a dataset of both ver- the HTML, and checked to determine if the ified and deleted torrents. We used an un- HTML title contained the string "Torrent was official KickAss API [6]to get features of the deleted". One potential problem with this torrents. Therefore, we only collected torrents method could be that the actual torrent name from KickassTorrents. We also considered us- contains the string âAIJTorrent˘ was deletedâA˘ I.˙ ing torrents from ThePirateBay, but their API However, this is easy to escape and we do not was out of date and did not work properly. The consider this to be a prevailing issue. features we chose to use were: Torrentname, verified_author, size, files, age, verified_torrent VI. Classification and torrentlink. We collected 10,000 torrent features from 9 categories of torrents(Games, 1. Unsupervised Learning Porn, Movies, TV-Shows, Books, Applications, Anime, Music and Others) allowing us to col- Unsupervised learning is a very efficient way lect 90,000 torrent features. to group data into different clusters based on their features, or find hidden patterns behind 4. Searching for names these features. It’s called "unsupervised" learning, because we are able to learn the clusters The first method we used to find deleted tor- and patterns even without knowing which la- rents was searching the site using the torrent- bel a piece of data belongs to. name feature we acquired via the API. We Among all the approaches to unsupervised looped through all 90,000 torrents and used the learning, K-means has been proved to be a sim-

3 UCSD Winter 16 CSE 227 • Final Project Report ple but efficient technique[3]. Given an input The idea of semi-supervised learning par- of quantified feature vectors, the K-means al- ticularly fits our problem, since, as stated in gorithm will divide all the observations into the previous section, a significant amount of k clusters. In each cluster, the center point is data is unlabeled in the dataset we created. We a prototype of this cluster. Each observation seek to make full use of the labeled data as will be assigned to a cluster whose prototype is well as the unlabeled data in the same model. the nearest to it. In other words, the algorithm Our approach is to modify the k-means gives us a solution of algorithm stated above. In the k-means algorithm, we use fully unsupervised learning to k 2 find the center of each torrent cluster, and we argminC(∑ ∑ ||x − ui|| ) i=1 x∈ci identified malicious torrents by the size of the torrent cluster. However, this approach will , where k is number of clusters, C is the set of fail to find popular malicious patterns, which clusters, u is the center of each cluster and X means that if there exists a cluster full of mali- is the set of observations. cious torrents, it will fail. We can improve this In the problem of malicious torrent detec- algorithm by introducing the verified torrents tion, each torrent is an observation. Based on into the training process. Since we now have the dataset we introduced in the previous sec- the information of the verified torrents, we can tion, we have a quantified vector, including directly use the features of these torrents as information about the number of seeds, the cluster centers, and then assign the unlabeled number of leeches, the size of the file, the num- torrents to the closest cluster. If the distance ber of files, and the age of the torrent for each between unlabeled data and its assigned clus- observation. ter center is longer than a threshold, we can To apply K-means on this dataset, our as- say that this torrent "looks" too different from sumption is that, a large number of normal all the normal torrents. Thus, we classify it as torrents share several similar feature patterns, a suspicious malicious torrent. and the malicious torrents, whose size is much smaller, will have some different feature patterns. For example, some kind of normal tor- 3. Supervised Learning rents, may have a similar number of seeds In supervised learning, a classifier is fed train- and leeches so these torrents will be grouped ing data and their respective target labels. The around their prototype behavior in K-means, classifier builds a model from the training data which will be big in size. However, since the and predicts the label for each data point. The number of malicious torrents is much smaller predicted labels are then compared to the tar- than the number of normal torrents, they will get labels and the classifier is evaluated. In the be assigned to some small clusters if they have following subsections, we study a number of a different feature patterns than normal tor- supervised learning techniques and evaluate rents. their performance on our training set. In order to evaluate the performance of the classifiers, 2. Semi-supervised Learning we use precision, recall and the F1-score. Semi-supervised learning is usually used in the 3.1 Linear Regression case that limited labeled data is provided, and some unlabeled data has to be used during Linear Regression is usually used to find linear the process of training[5]. It’s called "semi- relations between one a scalar dependent vari- supervised" because it uses both labeled and able y and one or more explanatory variables. unlabeled data for training, usually a smaller In our problem, we have five features of proportion labeled data with a larger propor- each torrent, the number of seeds, the num- tion of unlabeled data. ber of leeches, the size of the file, number of

4 UCSD Winter 16 CSE 227 • Final Project Report

files, and the age of the torrent. Among these used to separate the data. The intiuition here 5 feature, the first two, the number of seeds is that the SVM will be able to discover a non- and the number of leeches are actually not a linear boundary to seperate the data into two fixed feature. They describe how many people classes, fake/malicious and malignant. were sharing these torrent files and how many people had downloaded these torrent files at Label Precision Recall f1-score the time we created the dataset. So, they are Malicious 0.64 0.12 0.2 actually the status of the torrents at that partic- Malignant 0.99 1.0 1.0 ular time. These statuses will change as time From the table above, we can see that the passes, which means that they will change if SVM does a decent job classifying the samples the age of the torrent grows. Thus, we can see with a precision of 0.64. However, the recall, the statuses as scalar dependent variables and i.e. false positive rate is too low. This means the age as an explanatory variable in a linear that there is a large number of fake/malicious regression problem. torrents that are undetected. This is reflected However, upon further consideration we in the f1-score, i.e. the harmonic mean of the realized that the relation between seeds and precision and the recall. The f1-score is a good age as well as leeches and age are quadratic score to evaluate the classifier. The f1-score relations rather than linear relations. In the is in the range [0,1]. Ideally, we would want normal case, when a torrent is first released, to maximize the f1-score of a classifier. The the number of people sharing this torrent may f1-score of the SVM on our training set is 0.2. grow as time grows. However, after some time, This is weighed down by a low recall. This this torrent might become less popular and may be due to the small number of training people may choose to delete this file from their samples labeled as malicious. A larger number computer to save space, so the number of seeds of training samples would allow the classifier will start to go down. Thus, to fit this quadratic to genaralize and build a stronger model. relation we introduce the square of the age as an explanatory variable in this problem. The 3.3 K-Nearest Neighbors size of the torrent file and the number files may also affect the shape of these curse, so we add Since the class boundary between the data is these as explanatory variables as well. non-linear, we tried the K-Nearest Neighbors After we have learned these two groups of classifer, a non-parametric classifier that can relations based on the verified torrents, we can model arbitrarily complex funtcions. It works apply them to all the other unlabeled torrents. by labeling a training point according to the For each torrent, we have expected seeds and majority vote of its K nearest neighbors. The in- leeches, as well as an actual seeds and leeches. tuition here is that K-Nearest Neighbors would If the actual seeds and leeches for a torrent are be able to classify the data well. too far away from the expected ones, we will identified them as malicious. Label Precision Recall f1-score Malicious 0.67 0.19 0.3 Malignant 1.0 1.0 1.0 3.2 Support Vector Machine using Rbf- kernel From the table above, we can see that the K-nearest neighbors classifier does a decent Support vector machines (SVM) is a powerful job classifying the samples with a precision of linear classifier. However, since the relation- 0.67. However, the recall, i.e. false positive ships between the features is non-linear we use rate is too low. This means that there is a large an Rbf-kernel to tranform the data point from number of fake/malicious torrents that are un- our five dimensional feature space, to an ar- detected. This is reflected in the f1-score, i.e. bitrary dimension where a simple line can be the harmonic mean of the precision and the

5 UCSD Winter 16 CSE 227 • Final Project Report recall. The f1-score is a good score to evaluate deleted torrents, that we seemed to get several the classifier. The f1-score of the SVM on our more malicious torrents in the first 1000 (out of training set is 0.3. It does better than the SVM 10,000) torrents for each category. One reason classifier. However, the f1-score is still too low. for this, could be that the administrator had This is weighed down by a low recall. This already checked the remaining 9000 torrents may be due to the small number of training before we got our first dataset. This means samples labeled as malicious. A larger number that the fake/malicious torrents were already of training samples would allow the classifier deleted before we collected our first dataset to genaralize and build a stronger model. and we would not be able to collect those.

3.4 Decision Trees VIII. Future work Another popular nonparametric classification method that is able to model arbitrarily com- In order to alleviate the problem of torrents plex functions is a decision tree classifier. It already being checked by the moderators, we learns simple decision rules inferred from the could gather torrents periodically rather than data’s feature set. However, like most non- collecting all of them at once. We could also parametric methods, decision trees have risk of check to see Ä´sfthe torrents are deleted sooner overfitting to the data. than we did. We chose to wait a week to en- sure the moderators would have time to delete Label Precision Recall f1-score the fake/malicious torrents but this seems to Malicious 0.3 0.44 0.35 be unecessarily long. Finally, we could search Malignant 1.0 0.99 0.99 other torrent websites for malicious torrents. ThePiratebay and KickassTorrents tend to be From the table above, we can see that the more popular because they are generally safer. Decision tree classifier does has a low precision Collecting torrents from other less reputable on our training set. However, the recall, i.e. torrent websites may yield a higher number false positive rate does the best out of the other of fake/malicious torrents to help train our supervised classifiers in this category. This is classifier. also reflected in the f1-score, i.e. the harmonic mean of the precision and the recall.This means that there is a large number of fake/malicious References torrents that are undetected. Again, this may [1] Hatahet, Sinan, Bouabdallah, Abdelmad- be due to the small number of training sam- jid, and Challal, Yacine. "A new worm ples labeled as malicious. A larger number of propagation threat in BitTorrent: model- training samples would allow the classifier to ing and analysis". (2010). genaralize and build a stronger model. [2] Kryczka, Michal, Cuevas, Ruben, Gon- VII. Discussion zalez, Roberto, Cuevas, Angel and Az- corra, Arturo. "TorrentGuard: stopping Due to the small number of traning samples scam and malware distribution in the Bit- marked malicious and the limited feature set Torrent ecosystem". (2012). that we have, supervised and unsupervised learning techniques failed to classify the data [3] MacQueen, James. "Some methods for with both a high precision and recall, i.e. a classification and analysis of multivari- high f1-score. Unsupervised learning meth- ate observations." Proceedings of the fifth ods maximized the recall recahing 0.95Our tor- Berkeley symposium on mathematical rent dataset was collected over the span of two statistics and probability. Vol. 1. No. 14. weeks. However, we noticed while getting the 1967.

6 UCSD Winter 16 CSE 227 • Final Project Report

[4] Santos, Flavio Roberto, Luis da Costa [5] Zhu, Xiaojin. "Semi-supervised learning Cordeiro, Weverton, Gaspary, Luciano literature survey." (2005). Paschoal, and Barcellos, Marinho Pilla. "Choking Polluters in BitTorrent File Shar- ing Communities". (2010). [6] https://github.com/fm4d/KickassAPI