Classifying Illegal Activities on Network Based on Web Textual Contents Mhd Wesam Al Nabki1,2, Eduardo Fidalgo1,2, Enrique Alegre1,2, and Ivan de Paz1,2

1Department of Electrical, Systems and Automation, University of Leon,´ Spain 2 Researcher at INCIBE (Spanish National Cybersecurity Institute), Leon,´ Spain mnab, eduardo.fidalgo, ealeg, ivan.paz.centeno @unileon.es { }

Abstract indexed by the standard search engines, such as Google or Bing. However, despite their existence, The freedom of the offers a there is still an enormous part of the web remained safe place where people can express them- without indexing due to its vast size and the lack selves anonymously but they also can of hyperlinks, i.e. not referenced by the other web conduct illegal activities. In this pa- pages. This part, that can not be found using a per, we present and make publicly avail- , is known as Deep Web (Noor et 1 able a new dataset for active do- al., 2011; Boswell, 2016). Additionally, the con- mains, which we call it ”Darknet Usage tent might be locked and requires human interac- Text Addresses” (DUTA). We built DUTA tion to access e.g. to solve a CAPTCHA or to en- by sampling the Tor network during two ter a log-in credential to access. This type of web months and manually labeled each ad- pages is referred to as ”database-driven” websites. dress into 26 classes. Using DUTA, Moreover, the traditional search engines do not ex- we conducted a comparison between two amine the underneath layers of the web, and con- well-known text representation techniques sequently, do not reach the Deep Web. The Dark- crossed by three different supervised clas- net, which is also known as , is a subset sifiers to categorize the Tor hidden ser- of the Deep Web. It is not only not indexed and vices. We also fixed the pipeline ele- isolated, but also requires a specific software or ments and identified the aspects that have a dedicated proxy to access it. The Dark- a critical influence on the classification re- net works over a virtual sub-network of the World sults. We found that the combination of Wide Web (WWW) that provides an additional TF-IDF words representation with Logis- layer of anonymity for the network users. The tic Regression classifier achieves 96.6% most popular ones are ”The Onion Router”2 also of 10 folds cross-validation accuracy and known as Tor network, ”Invisible Internet Project” a macro F1 score of 93.7% when clas- I2P3, and Freenet4. The community of Tor refers sifying a subset of illegal activities from to Darknet websites as ”Hidden Services” (HS) DUTA. The good performance of the clas- which can be accessed via a special browser called sifier might support potential tools to help Tor Browser5. the authorities in the detection of these ac- A study by Bergman et al. (2001) has stated as- tivities. tonishing statistics about the Deep Web. For ex- ample, only on Deep Web there are more than 550 1 Introduction billion individual documents comparing to only 1 If we think about the web as an ocean of data, the billion on Surface Web. Furthermore, in the study Surface Web is no more than the slight waves that of Rudesill et al. (2015) they emphasized on the float on the top. While in the depth, there is a lot immensity of the Deep Web which was estimated of sunken information that is not reached by the to be 400 to 500 times wider than the Surface Web. traditional search engines. The web can be divided The concepts of Darknet and Deep Net have ex- into Surface Web and Deep Web. The Surface Web is the portion of the web that can be crawled and 2www.torproject.org 3www.geti2p.net 1The dataset is available upon request to the first author 4www.freenetproject.org (email). 5www.torproject.org/projects/ torbrowser.html.en

35 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 35–43, Valencia, Spain, April 3-7, 2017. 2017 Association for Computational Linguistics isted since the establishment of World Wide Web DUTA contains 26 categories that cover all the le- (WWW), but what make it very popular in the re- gal and the illegal activities monitored on Darknet cent years is when the FBI had arrested Pi- during our sampling period. Our objective is to rate Roberts, the owner of black mar- create a precise categorization of the Darknet via ket, in October 2013. The FBI has estimated the classifying the textual content of the HS. In order sales on Silk Road to be 1.2 Billion dollars by to achieve our target, we designed and compared July 2013. The trading network covered among different combinations of some of the most well- 150,000 anonymous customers and approximately known text classification techniques by identify- 4,000 vendors (Rudesill et al., 2015). The cryp- ing the key stages that have a high influence on the tocurrency (Nakamoto, 2008) is a hot topic in the method performance. We set a baseline method- field of Darknet since it anonymizes the financial ology by fixing the elements of text classification transactions and hides the trading parties identi- pipeline which allows the scientific community to ties (Ron and Shamir, 2014). compare their future research with this baseline The Darknet is often associated with illegal under the defined pipeline. The fixed methodology activities. In a study carried out by Intelliagg we propose might represent a significant contribu- group (2015) over 1K samples of hidden services, tion into a tool for the authorities who monitor the they claimed that 68% of Darknet contents would Darknet abuse. be illegal. Moore et at. (2016) showed, after an- The rest of the paper is organized as follows: alyzing 5K onion domains, that the most com- Section 2 presents the related work. Next, Sec- mon usages for Tor HS are criminal and illegal tion 3 explains the proposed dataset DUTA and its activities, such as drugs, weapons and all kind of characteristics. After that, Section 4 describes the pornography. set of the designed classification pipelines. Then, It is worth to mention about dramatic increase in in Section 5 we discuss the experiments performed the proliferation of Darknet domains which dou- and the results. In Section 6 we describe the bled their size from 30K to 60K between Au- technical implementation details and how we em- gust 2015 and 2016 (Figure 1). However, the ployed the successful classifier in an application. publicly reachable domains are no more than 6K Finally, in Section 7 we present our conclusions to 7K due to the ambiguity nature of the Dark- with a pointing to our future work. net (Ciancaglini et al., 2016). 2 Related Work

In the recent years, many researchers have inves- tigated the classification of the Surface Web (Du- mais and Chen, 2000; Sun et al., 2002; Kan, 2004; Kan and Thi, 2005; Kaur, 2014), and the Deep Web (Su et al., 2006; Xu et al., 2007; Barbosa et al., 2007; Lin et al., 2008; Zhao et al., 2008; Xian et al., 2009; Khelghati, 2016). However, the Darknet classification literature is still in its early stages and specifically the classification of the il- legal activities (Graczyk and Kinningham, 2015; Figure 1: The number of unique *.onion addresses Moore and Rid, 2016). in Tor network between August 2015 to August Kaur (2014) introduced an interesting survey 2016 covering several algorithms to classify web con- tent, paying attention to its importance in the field Motivated by the critical buried contents on the of data mining. Furthermore, the survey included Darknet and its high abuse, we focused our re- the pre-processing techniques that might help in search in designing and building a system that features selection, like eliminating the HTML classifies the illegitimate practices on Darknet. In tags, punctuation marks and stemming. Kan et this paper, we present the first publicly available al. explored the use of Uniform Resource Loca- dataset called ”Darknet Usage Text Addresses” tors (URL) in web classification by extracting the (DUTA) that is extracted from the Tor HS Darknet. features through parsing and segmenting it (Kan,

36 2004; Kan and Thi, 2005). These techniques can by performing a series of lexical text process- not be applied to Tor HS since the onion addresses ing. They found that 0.25% of entered queries are constructed with 16 random characters. How- are related to pedophilia context, which means that ever, tools like Scallion6 and Shallot7 allow Tor 0.2% of eDonkey network users are entering such users to create customized .onion addresses based queries. However, this method is based on a pre- on the brute-force technique e.g. Shallot needs 2.5 defined list of keywords which can not detect new years to build only 9 customized characters out or previously unknown words. of 16. Sun et at. (2002) employed Support Vec- tor Machine (SVM) to classify the web content by 3 The Dataset taking the advantage of the context features e.g. 3.1 Dataset Building Procedure HTML tags and hyperlinks in addition to the tex- tual features to build the feature set. To best of our knowledge, there is no labeled dataset that encompasses the activities on the Regarding the Deep Web classification, Noor et Darknet web pages. Therefore, we have created al. (2011) discussed the common techniques that the first publicly available Darknet dataset and we are used for the content extraction from the Deep called it Darknet Usage Text Addresses (DUTA) Web data sources called ”Query Probing”, which dataset. Currently, DUTA contains only Tor hid- is commonly used for supervised learning algo- den services (HS). We built a customized crawler rithms, and ”Visible Form Features” (Xian et al., that utilizes Tor socket to fetch onion web pages 2009). Su et al. (2006) have proposed a combi- through port 80 only i.e. the HTTP protocol. The nation between SVM with query probing to clas- crawler has 70 worker threads in parallel to down- sify the structured Deep Web hierarchically. Bar- load the HTML code behind the HS. Each thread bosa et al. (2007) proposed an unsupervised ma- dives into the second level in depth for each HS in chine learning clustering pipeline, in which Term order to gather as much text as possible rather than Frequency Inverse Document Frequency (TF-IDF) just the index page as in others work (Biryukov et was used for the text representation, and the co- al., 2014). It searches for the HS links on sev- sine similarity for distance measurement for the k- eral famous Darknet resources like onion.city 8 means. and .fi 9. We reached more than 250K HS With respect to the Darknet, Moore et. al. addresses, but only 7K were alive, and the oth- in (2016) have presented a new study based on Tor ers were down or not responding. After that, we hidden services to analyze and classify the Dark- concatenated the HTML pages of every HS into net. Initially, they collected 5K samples of Tor a single HTML file resulting a single HTML file onion pages and classified them into 12 classes for each single HS domain. We collected 7,931 using SVM classifier. Graczyk et al. (2015) pro- hidden services by running the crawler for two posed a pipeline to classify the products of a fa- months between May and July 2016. For the time mous black market on Darknet, called , into being, we labeled 6,831 samples. 12 classes with 79% of accuracy. Their pipeline architecture uses the TF-IDF for text features ex- 3.2 Dataset Characteristics traction, the PCA for features selection, and SVM Darknet researchers have analyzed the HS con- for features classification. tents and categorized them into a different num- Several attempts in literature have been pro- ber of categories. Biryukov et al. (2014) sampled posed to detect illegal activities whether on the 1,813 HS and detected 18 categories. Intelliagg World Wide Web (WWW) network (Biryukov et group in (2015) analyzed 1K HS samples and clas- al., 2014; Graczyk and Kinningham, 2015; Moore sified them into 12 categories. Moore et al. (2016) and Rid, 2016), peer-to-peer networks (P2P) (Lat- studied 5,615 HS examples and categorized them apy et al., 2013; Peersman et al., 2014) and in chat- into 12 classes. Based on our objective to build ting messaging systems (Morris and Hirst, 2012). a multipurpose dataset and for the sake of com- Latapy el at. (2013) investigated P2P systems, e.g. pleteness, we classified DUTA manually into 26 eDonkey, to quantify the paedophile activity by classes. To the best of our knowledge, this clas- building a tool to detect child-pornography queries sification is the most extent and complete up to

6www.github.com/lachesis/scallion 8www.onion.city 7www.github.com/katmagic/Shallot 9www.ahmia.fi

37 date. The collected samples were divided among 1) The text is very short i.e. less than 5 words, the four authors and each one labeled their des- 2) It has only images with no text, 3) It con- ignated part; if an author hesitated, it was openly tains unreadable text like special characters, num- discussed with the rest of the authors. Finally, to bers, or unreadable words, 4) The empty Cryp- check the consistency of the manual labeling, the tolockers pages (ransomware) (Ciancaglini et al., first author reviewed the final labeling by analyz- 2016). The class Locked contains the HS that re- ing random samples of the categorization made by quire solving a CAPTCHA or a log-in credential. the others. We noticed that some people love to present their In addition to labeling the main classes, we works, projects, or even their personal information dived into labeling the sub-classes of the HS. For through an HS page so we labeled them into class example, the class Counterfeit Personal Identifica- Personal. The pages that fell under more than one tion has three sub-classes: Identity Card, Driving category were labeled based on its main content. License, and Passport. Table 1 enumerates DUTA For example, we assign Forum label to the multi- classes. topic forums unless the whole forum is related to a single topic. e.g. a hacking forum was assigned Main Class Sub-Class Count Main Class Count Hate 4 Art/ Music 8 to Hacking class instead of Forum. The class Mar- Violence Casino/ Hitman 11 26 Gambling ketplace was divided into Black when it contained Weapons 47 Services 285 a group of illegal services like Drugs, Weapons, Driving- Counterfeit 4 Cryptocurrency 586 Licence and Counterfeit services and White when the mar- Personal ID 7 Down 608 Identification Passport 37 Empty 1649 ketplace offered legal shops like mobile phones or File- 111 Forum 104 clothes. Folders 63 Hacking 90 Hosting Search- As we have labeled DUTA manually, we re- and 38 Wiki 29 Engine Software Server 95 Leaked-Data 12 alized that some forums on HS contain numer- Software 121 Locked 435 ous web pages and all of them are related to a Directory 142 Personal 405 Illegal 230 Politics 8 single class i.e. we found a forum about child- Drugs Legal 9 Religion 6 Black 63 Library/Books 27 pornography that has more than 800 pages of tex- Marketplace White 67 Fraud 4 tual content, so we split it up into single samples Child- Counterfeit 914(10) 55 Pornography pornography Money representing one single forum page, and we added General- Counterfeit 83 240 pornography Credit Cards them to the dataset. Human- Blog 71 2 Trafficking Social- Chat 47 4 Methodology Network Email 56 News 32 The total count 6831 Each classification pipeline is comprised of three main stages. First, text pre-processing, then, fea- Table 1: DUTA dataset classes tures extraction, and finally, classification. We used two famous text representation techniques Counterfeit is a wide class so we split it into across three different supervised classifiers result- three main classes 1) Counterfeit Personal Identi- ing six different classification pipelines, and we fication which is related to government documents examined every pipeline to figure out the best forging. 2) Counterfeit Money includes curren- combination with the best parameters that can cies forging and 3) Counterfeit Credit Cards cov- achieve high performance. ers cloning credit cards, hacked PayPal accounts and fake markets cards like Amazon and eBay. 4.1 Text Pre-processing The class Services contains the legal services that Initially, we eliminated all the HTML tags, and are provided by individuals or organizations. The when we detected an image tag, we preserved class Down contains the errors that were returned the image name and removed the extension. Fur- by the down web pages while crawling them e.g. thermore, we filtered the training set for the non- an SQL error in a website database or a English samples using Langdetect11 python li- error. brary and stemmed the text using Porter library We assign class Empty to a web page when: from NLTK package12. Additionally, we re-

10This class includes 57 unique sample plus 857 samples 11https://pypi.python.org/pypi/langdetect that are extracted from a single forum (See Section 3.2) 12https://tartarus.org/martin/PorterStemmer/

38 moved special characters and stop words thanks called Others. Since we are working on classify- to SMART stop list13 (Salton, 1971). At this ing the illegal activities, we did not consider the stage, we modified the stop words list by adding class Black-Market in the training set because its 100 words more in order to make it compatible contents are related to more than one class at a with the work domain. Moreover, we mapped all single time, and we wanted the classifier to learn emails, URLs, and currencies into a single com- from pure patterns. Moreover, when a sample con- mon token for each. tains relevant images but an irrelevant text or with- out any textual information, we excluded it from 4.2 Features Extraction the dataset. Therefore, we had 5,635 samples dis- After pre-processing the text, we used two famous tributed over nine classes i.e. the eight classes text representation techniques. A) Bag-of-Words plus the Others one ( Table 2). After the text (BOW) is a well-known model for text representa- pre-processing, we got 5,002 sample split it into tion that extracts the features from the text corpus a training set that contains 3,501 samples and a by counting the words frequency. Consequently, testing set of 1,501 samples. every document is represented as a sparse feature Experiment Main Class Count vector where every feature corresponds to a sin- Pornography 963 gle word in the training corpus. B) Term Fre- Cryptocurrency 578 quency Inverse Document Frequency model (TD- Counterfeit Credit Cards 209 Drugs 169 IDF) (Aizawa, 2003) is a statistical model that as- Violence 60 sign weights for the vocabularies where it empha- Hacking 57 Counterfeit Money 46 sizes the words that occur frequently in a given Counterfeit Personal Identification 40 document, while at the same time de-emphasizes (Driving-License, ID, Passport) words that occur frequently in many documents. Others 3513 However, even though the BOW and TF-IDF do Table 2: Illegal activities dataset classes (A por- not take into considerations the words order, they tion of DUTA dataset) are simple, computationally efficient and compat- ible with medium dataset sizes. The dataset is highly unbalanced since the 4.3 Classifier Selection largest class has 3,513 samples while the small- est one has only 40 samples. We solved the skew For each features representation method, we ex- in the dataset thanks to the class-weight parameter amined three different supervised machine learn- in Scikit-Learn library14 which assigns a weight ing algorithms which are Support Vector Machine for each class proportional to the number of sam- (SVM) (Suykens and Vandewalle, 1999), Logis- ples it has (Hauck, 2014). In addition to adjusting tic Regression (LR) (Hosmer Jr and Lemeshow, the weights of classes, we split up forums by the 2004), and Naive Bayes (NB) (McCallum et al., discussion page (See Section 3.2). 1998). For the models tuning, we applied a grid search 5 Empirical Evaluation over different combinations of parameters with a cross-validation of 10 folds. The successful com- 5.1 Experimental Setting bination, which corresponds to the selected clas- Due to the purpose of this paper to classify the sification pipeline, is the one that can achieve the Darknet illegal activities, we selected a subset of highest value of an averaged F1 score metric and our DUTA dataset by creating eight categories try- an accuracy of 10 folds cross-validation. ing to cover the most representative illegal activ- We used Python3 with Scikit-Learn machine ities on the Darknet. Another condition that we learning library for the pipelines implementation. imposed was that each class in the selected sub- We modified the parameters that have a critical in- set should be monotopic (i.e. related to a single fluence on the performance of the models. For the category) and contain a sufficient amount of sam- BOW dictionary, we set it to 30,000 words with a ples (i.e. 40 samples minimum). The rest of the minimum word frequency of 3, and we left the rest classes are assigned to a 9th category which we of the parameters to default. Regarding the TF- IDF, we set the maximum feature vectors length to 13http://jmlr.csail.mit.edu/papers/ volume5/lewis04a/a11- smart-stop-list/ 14http://scikit-learn.org/

39 10,000 and the minimum to 3. With respect to the Metrics/ Average Average Average CV Methods (macro) (micro) (weighted) Accuracy P 0,952 0,965 0,965 classifiers parameters, we kept the default setting BOW 0,958 R 0,889 0,965 0,965 LR +/- 0,010 for the NB. In contrast, for the LR, we modified F1 0,916 0,965 0,964 P 0,982 0,974 0,975 only the value of the regularization parameter ”C” TFIDF 0,966 R 0,902 0,974 0,974 LR +/- 0,010 by setting it to 10 with the balanced class-weight F1 0,937 0,974 0,974 P 0,877 0,941 0,942 flag activated. For the SVM classifier, we set the BOW 0,932 R 0,875 0,941 0,941 SVM +/- 0,013 decision function parameter to one-vs-rest ”ovr”, F1 0,874 0,941 0,941 P 0,983 0,971 0,972 kernel to ”RBF”, ”C” parameter to 10e5, balanced TFIDF 0.960 R 0,882 0,971 0,971 SVM +/- 0,011 classes weights, and the rest were left to default. F1 0,924 0,971 0,970 P 0,865 0,941 0,943 BOW 0,924 R 0,790 0,941 0,941 5.2 Results and Discussion NB +/- 0,009 F1 0,812 0,941 0,940 P 0,530 0,885 0,855 Since we are working on an unbalanced multiclass TFIDF 0,863 R 0,425 0,885 0,885 NB +/- 0,012 problem, every class has a precision, a recall, and F1 0,460 0,885 0,860 an F1 score. To combine these three values into a single value, we calculated the macro, micro Table 3: A comparison between the classification and weighted average for each class as Table 3 pipelines with respect to 10 folds cross-validation shows. We can see that the pipeline of TF-IDF accuracy (CV), precision (P), recall (R) and F1 with LR achieves the highest value with a macro score metrics for micro, macro and weighted av- F1 score of 93.7% and the highest cross-validation eraging. accuracy of 96.6%. The state-of-the-art paper has achieved 94% accuracy on a different dataset that Credit Cards and Hacking have a low F1 score contains 1K samples (Intelliagg, 2015). Addition- over all the pipelines, which is due to several rea- ally, we plot the macro average precision-recall sons: firstly, the words interference between the curve for four classifiers (Figure 2). The plot indi- classes. For example, the websites which offer cates that the pipeline of TF-IDF with LR achieves counterfeiting credit cards services are most prob- the highest precision-recall. ably ”Hack” the credit card system or ”Attack” the PayPal accounts, the use sentences like ”We hack credit card” or ”Hacked Paypal account for sale”. Moreover, those classes intersect with Counter- feit Personal Identification class due to their sim- ilarity from the perspective of forgery. Secondly, the number of samples that were used for training plays an important role during the learning phase, e.g. class Violence has 60 samples only. Nevertheless, the learning curve for the TF-IDF LR pipeline in Figure 4 proves that the algorithm is learning correctly where the validation accuracy curve is raising up and classification accuracy is improving by increasing the number of the sam- ples while the training accuracy curve is starting to decrease slightly. This high accuracy archived will help to build a solid model that will be able to detect illegal activity on Darknet.

6 Application and Implementation Figure 2: Macro averaging Precision-Recall curve over 4 pipelines, where the area value corresponds The work presented in the previous sections has to the macro-average Precision-recall curve been included into an application that can be ac- cessed and tested through a . The Figure 3 shows F1 score comparison between implementation of the methods was developed in the six classification pipelines over the nine Python3 using Nltk library to stem the document classes. We can see that the classes Counterfeit text, Langdetect library to detect the language of

40 (a) (b)

(c)

Figure 3: F1 score comparison for each class for 6 classification pipelines. When a bar is not shown, Figure 5: The application has three interfaces. (a) it means that its value is zero. Pipeline selection. (b)The HS content preview. (c) The classification result.

containing 7K samples labeled manually into 26 categories. We picked out nine classes, including the Others class, that are related only to illegal ac- tivities e.g. drugs trading and child pornography and we used it for training our model. Further- more, we distinguished the critical aspects that af- fect the classification pipeline results in term of text representation i.e. the dictionary size and the minimum word frequency influence the text rep- resentation techniques performance, and the regu- Figure 4: Learning Curve for TF-IDF with LR larization parameter on the LR and the SVM clas- classifier sifiers. We found that the combination of the TF- IDF text representation with the Logistic Regres- sion classifier can achieve 96.6% accuracy over the documents and the Scikit-learn library to build 10 folds of cross-validation and 93.7% macro F1 the classifiers. The web application is made up of score. We noticed that our classifier suffers from 3 views: one for algorithm selection, the second overfitting due to the difficulty of reaching more one for the selection of data to analyze and the samples of onion hidden services for some classes third one for showing the results of the analysis like counterfeiting personal identification or ille- (Figure 5). gal drugs. However, our results are encouraging, The Docker image is not publicly available, nei- and yet there is still a wide margin for future im- ther the applications, but under email request, we provements. We are looking forward to enlarg- will grant a temporal access to the web interface. ing the dataset by digging deeper into the Dark- net by adding more HS sources, even from 7 Conclusions and Future Work and , and exploring ports other than the In this paper, we have categorized illegal activities HTTP port. Moreover, we plan to get the benefit of Tor HS by using two text representation meth- of the HTML tags and the hyperlinks by weight- ods, TF-IDF and BOW, combined with three clas- ing some tags or parsing the hyperlinks text. Also, sifiers, SVM, LR, and NB. To support the clas- during the manual labeling of the dataset, we re- sification pipelines, we built the dataset DUTA, alized that a wide portion of the hidden services

41 advertise their illegal products graphically, i.e. the David W. Hosmer Jr. and Stanley Lemeshow. 2004. service owner uses the images instead of the text. Applied logistic regression. John Wiley & Sons. Therefore, our aim is to build an image classifier Intelliagg. 2015. Deep light shining a light on the dark to work in parallel with the text classification. The web. Magazine. high accuracy we have obtained in this work might represent an opportunity to insert our research into Min-Yen Kan and Hoang Oanh Nguyen Thi. 2005. Fast webpage classification using url features. In a tool that supports the authorities in monitoring Proceedings of the 14th ACM international confer- the Darknet. ence on Information and knowledge management, pages 325–326. ACM. 8 ACKNOWLEDGEMENT Min-Yen Kan. 2004. Web page classification with- This research was funded by the agreement out the web page. In Proceedings of the 13th inter- national World Wide Web conference on Alternate between the University of Len and INCIBE (Span- track papers & posters, pages 262–263. ACM. ish National Cybersecurity Institute) under adden- dum 22. We also want to thanks Francisco J. Ro- Prabhjot Kaur. 2014. Web content classification: A drguez (INCIBE) for providing us the .onion web survey. arXiv preprint arXiv:1405.0580. pages used to create the dataset. S Mohammadreza Khelghati. 2016. Deep Web Con- tent Monitoring. Ph.D. thesis.

Matthieu Latapy, Clemence Magnien, and Raphael References Fournier. 2013. Quantifying paedophile activity in Akiko Aizawa. 2003. An information-theoretic per- a large p2p system. Information Processing & Man- spective of tf–idf measures. Information Processing agement, 49(1):248–263. & Management, 39(1):45–65. Peiguang Lin, Yibing Du, Xiaohua Tan, and Chao Lv. Luciano Barbosa, Juliana Freire, and Altigran Silva. 2008. Research on automatic classification for deep 2007. Organizing hidden-web databases by cluster- web query interfaces. In Information Processing ing visible web documents. In 2007 IEEE 23rd In- (ISIP), 2008 International Symposiums on, pages ternational Conference on Data Engineering, pages 313–317. IEEE. 326–335. IEEE. Andrew McCallum, Kamal Nigam, et al. 1998. A Michael K. Bergman. 2001. White paper: the deep comparison of event models for naive bayes text web: surfacing hidden value. Journal of electronic classification. In AAAI-98 workshop on learning for publishing, 7(1). text categorization, volume 752, pages 41–48. Cite- seer. Alex Biryukov, Ivan Pustogarov, Fabrice Thill, and Daniel Moore and Thomas Rid. 2016. Cryptopolitik Ralf-Philipp Weinmann. 2014. Content and popu- and the darknet. Survival, 58(1):7–38. larity analysis of tor hidden services. In 2014 IEEE 34th International Conference on Distributed Com- Colin Morris and Graeme Hirst. 2012. Identifying puting Systems Workshops (ICDCSW), pages 188– sexual predators by svm classification with lexical 193. IEEE. and behavioral features. In CLEF (Online Working Notes/Labs/Workshop), volume 12, page 29. Wendy Boswell. 2016. How to mine the invisible web: The ultimate guide. Satoshi Nakamoto. 2008. Bitcoin: A peer-to-peer electronic cash system. V. Ciancaglini, M. Balduzzi, R. McArdle, and M. Rosler.¨ 2016. Below the surface: Exploring the Umara Noor, Zahid Rashid, and Azhar Rauf. 2011. deep web. Trend Micro Incorporated. As of, 12. A survey of automatic deep web classification tech- niques. International Journal of Computer Applica- Susan Dumais and Hao Chen. 2000. Hierarchical tions, 19(6):43–50. classification of web content. In Proceedings of the 23rd annual international ACM SIGIR confer- Claudia Peersman, Christian Schulze, Awais Rashid, ence on Research and development in information Margaret Brennan, and Carl Fischer. 2014. icop: retrieval, pages 256–263. ACM. Automatically identifying new child abuse media in p2p networks. In Security and Privacy Workshops Michael Graczyk and Kevin Kinningham. 2015. Au- (SPW), 2014 IEEE, pages 124–131. IEEE. tomatic product categorization for anonymous mar- ketplaces. Dorit Ron and Adi Shamir. 2014. How did dread pi- rate roberts acquire and protect his bitcoin wealth? Trent Hauck. 2014. scikit-learn Cookbook. Packt Pub- In International Conference on Financial Cryptog- lishing Ltd. raphy and Data Security, pages 3–15. Springer.

42 Dakota S Rudesill, James Caverlee, and Daniel Sui. 2015. The deep web and the darknet: A look inside the internet’s massive black box. Woodrow Wilson International Center for Scholars, STIP, 3. Gerard Salton. 1971. The smart retrieval systemexper- iments in automatic document processing. Weifeng Su, Jiying Wang, and Frederick Lochovsky. 2006. Automatic hierarchical classification of struc- tured deep web databases. In International Con- ference on Web Information Systems Engineering, pages 210–221. Springer. Aixin Sun, Ee-Peng Lim, and Wee-Keong Ng. 2002. Web classification using support vector machine. In Proceedings of the 4th international workshop on Web information and data management, pages 96– 99. ACM. Johan A.K. Suykens and Joos Vandewalle. 1999. Least squares support vector machine classifiers. Neural processing letters, 9(3):293–300. Xuefeng Xian, Pengpeng Zhao, Wei Fang, Jie Xin, and Zhiming Cui. 2009. Automatic classification of deep web databases with simple query interface. In Industrial Mechatronics and Automation, 2009. ICIMA 2009. International Conference on, pages 85–88. IEEE. He-Xiang Xu, Xiu-Lan Hao, Shu-Yun Wang, and Yun- Fa Hu. 2007. A method of deep web classifica- tion. In 2007 International Conference on Machine Learning and Cybernetics, volume 7, pages 4009– 4014. IEEE. Pengpeng Zhao, Li Huang, Wei Fang, and Zhiming Cui. 2008. Organizing structured deep web by clus- tering query interfaces link graph. In International Conference on Advanced Data Mining and Applica- tions, pages 683–690. Springer.

43