The Language of Legal and Illegal Activity on the Darknet
Total Page:16
File Type:pdf, Size:1020Kb
The Language of Legal and Illegal Activity on the Darknet Leshem Choshen∗ Dan Eldad∗ Daniel Hershcovich∗ Elior Sulem∗ Omri Abend School of Computer Science and Engineering The Hebrew University of Jerusalem {leshem.choshen,dan.eldad1,daniel.hershcovich, elior.sulem,omri.abend}@mail.huji.ac.il Abstract fact, the only work we are aware of that classi- fied Darknet texts into legal and illegal activity is The non-indexed parts of the Internet (the Avarikioti et al.(2018), but they too did not inves- Darknet) have become a haven for both le- tigate in what ways these two classes differ. gal and illegal anonymous activity. Given the magnitude of these networks, scalably moni- This paper addresses this gap, and studies the toring their activity necessarily relies on auto- distinguishing features between legal and illegal mated tools, and notably on NLP tools. How- texts in Onion sites, taking sites that advertise ever, little is known about what characteristics drugs as a test case. We compare our results texts communicated through the Darknet have, to a control condition of texts from eBay2 pages and how well off-the-shelf NLP tools do on that advertise products corresponding to drug key- this domain. This paper tackles this gap and words. performs an in-depth investigation of the char- acteristics of legal and illegal text in the Dark- We find a number of distinguishing features. net, comparing it to a clear net website with First, we confirm the results of Avarikioti et al. similar content as a control condition. Tak- (2018), that text from legal and illegal pages ing drug-related websites as a test case, we (henceforth, legal and illegal texts) can be distin- find that texts for selling legal and illegal drugs guished based on the identity of the content words have several linguistic characteristics that dis- (bag-of-words) in about 90% accuracy over a bal- tinguish them from one another, as well as anced sample. Second, we find that the distri- from the control condition, among them the distribution of POS tags, and the coverage of bution of POS tags in the documents is a strong their named entities in Wikipedia.1 cue for distinguishing between the classes (about 71% accuracy). This indicates that the two classes 1 Introduction are different in terms of their syntactic structure. Third, we find that legal and illegal texts are The term “Darknet” refers to the subset of Inter- roughly as distinguishable from one another as le- net sites and pages that are not indexed by search gal texts and eBay pages are (both in terms of their engines. The Darknet is often associated with the words and their POS tags). The latter point sug- “.onion” top-level domain, whose websites are re- gests that legal and illegal texts can be consid- ferred to as “Onion sites”, and are reachable via ered distinct domains, which explains why they arXiv:1905.05543v2 [cs.CL] 4 Jun 2019 the Tor network anonymously. can be automatically classified, but also implies Under the cloak of anonymity, the Darknet har- that applying NLP tools to Darknet texts is likely bors much illegal activity (Moore and Rid, 2016). to face the obstacles of domain adaptation. Indeed, Applying NLP tools to text from the Darknet is we show that named entities in illegal pages are thus important for effective law enforcement and covered less well by Wikipedia, i.e., Wikification intelligence. However, little is known about the works less well on them. This suggests that for characteristics of the language used in the Dark- high-performance text understanding, specialized net, and specifically on what distinguishes text on knowledge bases and tools may be needed for pro- websites that conduct legal and illegal activity. In cessing texts from the Darknet. By experiment- ∗ Equal contribution ing on a different domain in Tor (user-generated 1Our code can be found in https://github.com/ huji-nlp/cyber. Our data is available upon request. 2https://www.ebay.com content), we show that the legal/illegal distinction hidden services correspond to “suspicious” activ- generalizes across domains. ities. The analysis was conducted using the text After discussing previous works in Section2, classifier presented in Al Nabki et al.(2017) and we detail the datasets used in Section3. Differ- manual verification. ences in the vocabulary and named entities be- Recently, Avarikioti et al.(2018) presented an- tween the classes are analyzed in Section4, before other topic classification of text from Tor together the presentation of the classification experiments with a first classification into legal and illegal (Section5). Section6 presents additional experi- activities. The experiments were performed on ments, which explore cross-domain classification. a newly crawled corpus obtained by recursive We further analyze and discuss the findings in Sec- search. The legal/illegal classification was done tion7. using an SVM classifier in an active learning set- ting with bag-of-words features. Legality was as- 2 Related Work sessed in a conservative way where illegality is assigned if the purpose of the content is an obvi- The detection of illegal activities on the Web is ously illegal action, even if the content might be sometimes derived from a more general topic clas- technically legal. They found that a linear kernel sification. For example, Biryukov et al.(2014) worked best and reported an F1 score of 85% and classified the content of Tor hidden services into an accuracy of 89%. Using the dataset of Al Nabki 18 topical categories, only some of which corre- et al.(2019), and focusing on specific topical cate- late with illegal activity. Graczyk and Kinning- gories, we here confirm the importance of content ham(2015) combined unsupervised feature selec- words in the classification, and explore the linguis- tion and an SVM classifier for the classification of tic dimensions supporting classification into legal drug sales in an anonymous marketplace. While and illegal texts. these works classified Tor texts into classes, they did not directly address the legal/illegal distinc- 3 Datasets Used tion. Some works directly addressed a specific type Onion corpus. We experiment with data from of illegality and a particular communication con- Darknet websites containing legal and illegal ac- text. Morris and Hirst(2012) used an SVM classi- tivity, all from the DUTA-10K corpus (Al Nabki fier to identify sexual predators in chatting mes- et al., 2019). We selected the “drugs” sub-domain sage systems. The model includes both lexical as a test case, as it is a large domain in the corpus, features, including emoticons, and behavioral fea- that has a “legal” and “illegal” sub-categories, and tures that correspond to conversational patterns. where the distinction between them can be reliably Another example is the detection of pedophile made. These websites advertise and sell drugs, of- activity in peer-to-peer networks (Latapy et al., ten to international customers. While legal web- 2013), where a predefined list of keywords was sites often sell pharmaceuticals, illegal ones are used to detect child-pornography queries. Besides often related to substance abuse. These pages are lexical features, we here consider other general directed by sellers to their customers. linguistic properties, such as syntactic structure. eBay corpus. As an additional dataset of sim- Al Nabki et al.(2017) presented DUTA (Dark- ilar size and characteristics, but from a clear net net Usage Text Addresses), the first publicly source, and of legal nature, we compiled a corpus available Darknet dataset, together with a man- of eBay pages. eBay is one of the largest hosting ual classification into topical categories and sub- sites for retail sellers of various goods. Our cor- categories. For some of the categories, legal and pus contains 118 item descriptions, each consist- illegal activities are distinguished. However, the ing of more than one sentence. Item descriptions automatic classification presented in their work fo- vary in price, item sold and seller. The descrip- cuses on the distinction between different classes tions were selected by searching eBay for drug re- of illegal activity, without addressing the distinc- lated terms,3 and selecting search patterns to avoid tion between legal and illegal ones, which is the over-repetition. For example, where many sell subject of the present paper. Al Nabki et al.(2019) the same product, only one example was added to extended the dataset to form DUTA-10K, which we use here. Their results show that 20% of the 3Namely, marijuana, weed, grass and drug. All Onion eBay Illegal Onion Legal Onion all half 1 half 2 all half 1 half 2 all half 1 half 2 all half 1 half 2 all 0.23 0.25 0.60 0.61 0.61 0.33 0.39 0.41 0.35 0.41 0.42 All Onion half 1 0.23 0.43 0.60 0.62 0.62 0.37 0.33 0.50 0.40 0.36 0.52 half 2 0.25 0.43 0.61 0.62 0.62 0.39 0.50 0.35 0.39 0.51 0.35 all 0.60 0.60 0.61 0.23 0.25 0.59 0.60 0.60 0.66 0.67 0.67 eBay half 1 0.61 0.62 0.62 0.23 0.43 0.60 0.61 0.61 0.67 0.67 0.68 half 2 0.61 0.62 0.62 0.25 0.43 0.60 0.61 0.61 0.67 0.68 0.68 all 0.33 0.37 0.39 0.59 0.60 0.60 0.23 0.27 0.61 0.62 0.62 Illegal Onion half 1 0.39 0.33 0.50 0.60 0.61 0.61 0.23 0.45 0.62 0.63 0.62 half 2 0.41 0.50 0.35 0.60 0.61 0.61 0.27 0.45 0.62 0.63 0.63 all 0.35 0.40 0.39 0.66 0.67 0.67 0.61 0.62 0.62 0.26 0.26 Legal onion half 1 0.41 0.36 0.51 0.67 0.67 0.68 0.62 0.63 0.63 0.26 0.47 half 2 0.42 0.52 0.35 0.67 0.68 0.68 0.62 0.62 0.63 0.26 0.47 Table 1: Jensen-Shannon divergence between word distribution in all Onion drug sites, Legal and Illegal Onion drug sites, and eBay sites.