U. C. Berkeley Dissertation
Total Page:16
File Type:pdf, Size:1020Kb
The Dark Net: De-Anonymization, Classification and Analysis by Rebecca Sorla Portnoff A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor David Wagner, Chair Professor Vern Paxson Professor David Bamman Spring 2018 The Dark Net: De-Anonymization, Classification and Analysis Copyright c 2018 by Rebecca Sorla Portnoff 1 Abstract The Dark Net: De-Anonymization, Classification and Analysis by Rebecca Sorla Portnoff Doctor of Philosophy in Computer Science University of California, Berkeley Professor David Wagner, Chair The Internet facilitates interactions among human beings all over the world, with greater scope and ease than we could have ever imagined. However, it does this for both well-intentioned and malicious actors alike. This dissertation focuses on these malicious persons and the spaces online that they inhabit and use for profit and plea- sure. Specifically, we focus on three main domains of criminal activity on the clear web and the Dark Net: classified ads advertising trafficked humans for sexual services, cy- ber black-market forums, and Tor onion sites hosting forums dedicated to child sexual abuse material (CSAM). In the first domain, we develop tools and techniques that can be used separately and in conjunction to group Backpage sex ads by their true author (and not the claimed author in the ad). Sites for online classified ads selling sex are widely used by human traffickers to support their pernicious business. The sheer quantity of ads makes manual exploration and analysis unscalable. In addition, discerning whether an ad is advertising a trafficked victim or an independent sex worker is a very difficult task. Very little concrete ground truth (i.e., ads definitively known to be posted by a trafficker) exists in this space. In the first chapter of this dissertation, we develop a machine learning classifier that uses stylometry to distinguish between ads posted by the same vs. different authors with 90% TPR and 1% FPR. We also design a linking technique that takes advantage of leakages from the Bitcoin mempool, blockchain and sex ad site, to link a subset of sex ads to Bitcoin public wallets and transactions. Finally, we demonstrate via a 4-week proof of concept using Backpage as the sex ad site, how an analyst can use these automated approaches to potentially find human traffickers. In the second domain, we develop machine learning tools to classify and extract information from cyber black-market forums. Underground forums are widely used by criminals to buy and sell a host of stolen items, datasets, resources, and crimi- nal services. These forums contain important resources for understanding cybercrime. However, the number of forums, their size, and the domain expertise required to under- stand the markets makes manual exploration of these forums unscalable. In the second 2 chapter of this dissertation, we propose an automated, top-down approach for analyz- ing underground forums. Our approach uses natural language processing and machine learning to automatically generate high-level information about underground forums, first identifying posts related to transactions, and then extracting products and prices. We also demonstrate, via a pair of case studies, how an analyst can use these automated approaches to investigate other categories of products and transactions. We use eight distinct forums to assess our tools: Antichat, Blackhat World, Carders, Darkode, Hack Forums, Hell, L33tCrew and Nulled. Our automated approach is fast and accurate, achieving over 80% accuracy in detecting post category, product, and prices. In the third domain, we develop a set of features for a principal component analysis (PCA) based anomaly detection system to extract producers (those actively abusing children) from the full set of users on Tor CSAM forums. These forums are visited by tens of thousands of pedophiles daily. The sheer quantity of users and posts make manual exploration and analysis unscalable. In the final chapter of this dissertation, we demonstrate how to extract producers from unlabeled, public forum data. We use four distinct forums to assess our tools; these forums remain unnamed to protect law enforcement investigative efforts. We have released our code written for the first two domains, as well as the proof of concept data from the first domain, and a sub-set of the labeled data from the second domain, allowing replication of our results.1 1As of the filing of this dissertation, our code and data are available online at https://github.com/rsportnoff/anti-trafficking-tools and https://evidencebasedsecurity.org/forums/ i To my daughter, Esther Eunmee, and all the other kids. Contents Contents ii List of Figures iv List of Tables v 1 Introduction 1 1.1 Classified Ads Selling Sex . .1 1.2 Cyber Black-Market Forums . .3 1.3 Child Sexual Abuse Material Forums . .4 2 Classified Ads Selling Sex 7 2.1 Introduction . .7 2.2 Related Work . .8 2.2.1 Sex Trafficking Online . .8 2.2.2 Bitcoin . 10 2.3 Datasets . 11 2.3.1 Backpage . 11 2.3.2 Bitcoin . 12 2.4 Author Classifier . 15 2.4.1 Labeling Ground Truth . 15 2.4.2 Models . 15 2.4.3 Validation Results . 16 2.5 Linking Ads to Bitcoin Transactions . 18 2.6 Grouping Ads by Owner . 20 2.6.1 Grouping by Shared Author . 21 2.6.2 Grouping by Persistent Bitcoin Identities . 22 2.7 Case Study . 23 2.7.1 Price Reconstruction . 23 2.7.2 Linking Backpage Ads to Bitcoin Transactions . 24 iii 2.7.3 Results . 26 2.8 Discussion . 33 2.9 Conclusion . 35 3 Cyber Black-Market Forums 36 3.1 Introduction . 36 3.2 Related Work . 37 3.2.1 Ecosystem Analysis . 37 3.2.2 Classification for Black-Markets . 38 3.2.3 NLP Tools . 39 3.3 Forum Datasets . 40 3.4 Automated Processing . 41 3.4.1 Type-of-Post Classification . 42 3.4.2 Product Extraction . 45 3.4.3 Price Extraction . 50 3.4.4 Currency Exchange Extraction . 52 3.5 Analysis . 54 3.5.1 End-to-end error analysis . 54 3.5.2 Broadly Characterizing a Forum . 54 3.5.3 Performance . 54 3.6 Case Studies . 55 3.6.1 Identifying Account Activity . 56 3.6.2 Currency Exchange Patterns . 57 3.7 Conclusion . 57 4 CSAM Forums 60 4.1 Introduction . 60 4.2 Related Work . 61 4.2.1 Anomaly Detection in Social Network Users . 61 4.2.2 CSAM . 62 4.3 Forum Datasets . 66 4.4 Extracting Producers . 67 4.4.1 Labeling Ground Truth . 68 4.4.2 Features . 68 4.4.3 Public Data . 68 4.4.4 Validation Results . 69 4.5 Conclusion . 75 5 Conclusion 76 A 87 iv List of Figures 2.1 Example of how a new transaction is added to Bitcoin's peer-to-peer network. 14 2.2 Example of a false positive case with possibly flawed ground truth. 18 2.3 Example of a false negative case where the ads appear in different sections. 19 2.4 Linking Ads to Bitcoin Transactions . 19 2.5 Shared Authors . 20 2.6 Persistent Bitcoin Identities . 22 3.1 Example post and annotations from Darkode, with one sentence per line. We underline annotated product tokens. The second exhibits our annotations of both the core product (mod DCIBot) and the method for obtaining that product (sombody). .................... 46 3.2 Number of transactions of each type observed in each forum. Numbers indicate how many posts had each type of transaction. If multiple cur- rencies appear for either side of the transaction, then we apportion a fraction to each option. Colors indicate... 58 4.1 Number of Producers in top % of Ranked Users, with varying Principal Components: Forum 1 . 70 4.2 Number of Producers in top % of Ranked Users, with varying Principal Components: Forum 2 . 71 4.3 Number of Producers in top % of Ranked Users, with varying Principal Components: Forum 3 . 72 4.4 Number of Producers in top % of Ranked Users, with varying Principal Components: Forum 4 . 73 4.5 Number of Producers in top % of Ranked Users, with varying Principal Components: All Forums . 74 v List of Tables 2.1 Backpage . 12 2.2 Classification accuracy for same vs. different author. 17 2.3 Distribution of Ads by Category during 4-week Case Study . 22 2.4 Price calculation correctness. 24 2.5 Transaction to ad linking correctness (Exact Match - EM, Multiple Match - MM). 26 2.6 Multiple match transaction to No. ads matched. 27 2.7 Number of Pairs of Authors Linked within SA Wallet using location and Jaccard . 27 2.8 SA Wallets with Zero FP by Location and Zero FP by Jaccard threshold 0.9...................................... 28 2.9 SA Wallets with between 0%-25% of pairs of authors linked by Jaccard threshold 0.9 . 29 2.10 Number of Transactions Linked to Exact Match within PBI single exact match Wallet using author, area code and location . 31 2.11 Number of Pairs of Authors Linked within MEM Wallet using author, area code and location for Exact Matches . 32 2.12 Final SA Cluster Statistics by Jaccard > 0.9 . 33 2.13 Final PBI Statistics . 33 3.1 General properties of the forums considered. 40 3.2 Number of labeled posts by class for type-of-post classification. 44 3.3 Classification accuracy on the buy vs. sell task. MFL refers to a baseline that simply returns the most frequent label in the training set. Note that the test sets for within-forum evaluation... 44 3.4 Classification accuracy on the buy vs.