Data Mining: A Canadian Cyber-Crime Perspective

Edward Crowder Jay Lansiquot Institute of Technology and Institute of Technology and Advanced Learning Advanced Learning Sheridan College Sheridan College Oakville, Canada Oakville, Canada [email protected] [email protected]

Abstract— Exploring the darknet can be a daunting task; Pacific region these same tools perceive an increase to this paper explores the application of data mining the darknet exacerbated [19]. The project discusses the risks within a Canadian cybercrime perspective. Measuring activity associated with using the darknet as both a user and through marketplace analysis and vendor attribution has cybercrime analyst, followed by an outline of the proposed proven difficult in the past. Observing different aspects of the system design. darknet and implementing methods of monitoring and collecting data in the hopes of connecting contributions to the Considering benefits and threats that may pose would darknet marketplaces to and from Canada. The significant be beneficial. There was an increase in the Tor network usage, findings include a small Canadian presence, measured the specifically in Canada[16] between 2019 and 2020. At a product categories, and attribution of one cross-marketplace minimum, this project may uncover why more Canadians turn vendor through data visualization. The results were made to the Tor network as shown in Fig. 1., and what the intended possible through a multi-stage processing pipeline, including usage is. data crawling, scraping, and parsing. The primary future works include enhancing the pipeline to include other media, such as web forums, chatrooms, and emails. Applying machine learning models like natural language processing or sentiment analysis could prove beneficial during investigations.

Keywords— Darknet, Canada, Marketplace, Data Mining, Privacy, Threat Intelligence, Cybersecurity, Cybercrime

I. INTRODUCTION To fully understand the threat landscape, you first must correctly identify and fully understand the threat model of an enterprise or country. This research project explores the darknet and its applications by means of data mining. The results include an analysis of current and past darknet marketplaces, a data model capable of further machine learning for indicators of compromise (IOC) analysis, and value analysis for identifying threats in the darknet. Presenting Fig. 1. Tor relay for Canadian users between 2019-01- a sample application that includes a web interface consisting 01 – 2020-06-27 [16] of organized threat information visualized for a qualified analyst to make strategic cyber decisions. II. RELATED WORK To maintain a common terminology, the darknet is a resource Web crawlers have been around since the early 1990's[10]. that cannot be accessed without The Onion Router (“TOR”) The most notable of all web crawler projects being S. Brin [1, 3]. , a 501(c)3 US nonprofit, advocates and L. Page's Google[11]. As the grew, it segregated human rights and the defense of a users privacy online through into multiple layers, known as the Clearnet, Deepnet, and free software and open networks[16]. There are many benefits Darknet[12]. Exploring the Darknet provides many benefits to the darknet, such as online and enhanced privacy[20]. However, in a recent survey of 25,229 general if done correctly and in a time-sensitive manner. internet users by the Centre for International Governance Researchers have found great value in extracting indicators Innovation (CIGI) that took place across North America, Latin for use in private companies, government, and personal America, Europe, the Middle East, Africa, and the Asia- protection [2, 4, 5, 6]. As a result, many large data sets such

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE as the Darknet Marketplace Archive (DNM) from 2011- Miner. D-Miner is a darknet "focused" web scraper that 2015 [7] are publicly available. However, it was decided not collects and parses out specific darknet marketplace features. to use the data sets provided for many reasons. First, with By utilizing JSON, Lawrence et al. gain the benefits of only 6 of the 89 DNMs remain accessible[3], it would be a indexing the data in Elasticsearch. Elasticsearch is a search better representation of current technologies used in new engine based on the apache lucence . The darknet marketplaces to data-mine active markets instead of power of Elasticsearch allows Lawrence et al. to utilize referring to dead sites. features such as full-text search and REST API's[9]. Data Understanding the potential ethical, moral, and physical risks visualization is made possible through Kibana, an open- surrounding the darknet is also essential to keep in mind. source data visualization dashboard for Elasticsearch[14]. Martin et al. [1] explore the significant uncertainties regarding the ethical dimensions of crypto market research. A primary issue surrounding the Nunes et al. project was the Furthermore, there are so many different environments (e.g., use of anti-scraping technologies deployed on the darknet. web pages, chat rooms, e-mail) and that there are new ones The solution proposed was death-by-captcha (DBC), a paid continually emerging means that explicit [ethical] rules are service to solve captcha codes to automate the solution [3]. not possible [1]. Ethical problems are demonstrated by Hayes et al. [5] take a similar approach to their analysis by example, through the use of known ethical principals and identifying the vendors. They explore the use of Apple Script collaborations with others involved in the study of and the Maltego investigation platform to generate a cross- cryptomarkets. Martin et al. further discuss the risks and site threat actor connected graph[15]. Notably, the authors threats of assessment, geographical concerns, copyright outsourced the requirement of solving captcha to the analyst, issues, the effects on the public, and academic research and the extraction and enrichment process to Maltego’s built- considerations, such as determining national jurisdiction, in investigation transformers. Combined, this made a robust self-critical awareness of the potential for bias, and many framework for the manual analyst. However, it was not more. Their research concludes with open-ended questions scalable to the extent desired, and therefore, decided to for the researchers to consider metacognition regarding the continue with the use of Python and custom interfacing decisions within their project. options.

Dittus et al.[4] performed a large scale systematic data collection of the darknet in mid-2017, which claimed to III. RISK ASSESSMENT cover 80% of the darknet. Their findings show that 70% of This project uses passive fieldwork, that is, only observing global trades are attributable to the "top five" countries: publicly available material that does not require direct USA, U.K., Australia, Germany, and Holland. Their research communication or response to possible nefarious actors shows Canada falls in sixth place within their findings. except for account verification. The alternative to this would Research also suggests that the darknet is not revolutionizing be active fieldwork, which would include participation this crime. However, it changes only the "last mile" and only within the darknet communities[21]. in high consumer countries, leaving old trafficking routes still intact.[4, 13]. The publishing of this research topic is a prime example of an ethical dilemma Martin et al. discussed. This paper's results could negatively influence the public funding surrounding the darknet drug trade because of it. The project aims to create a system where information is provided for a qualified analyst to weigh in from their experience and not overstate risks.

Nunes et al. [2] present an operational system for cyber Fig. 2. Active Fieldwork exception for account threat intelligence gathering from various sites on the verification post on darknet marketplace using Darknet. Nunes et al. focused on malicious indicators such as username “Olaf” threat actor names, private sale of data, and executables that they utilized to fulfill their primary intelligence requirements for emerging threat detection. The creation of a focused web Identity and geolocation protection is made possible by crawler, as opposed to a generic web crawler, was required utilizing the Amazons Web Services (“AWS”) public cloud to collect a vast amount of data. Static processing was done as a technical security mechanism. If the crawler, scraper, or after mass collection to extract indicators of interest. other communication to the darknet were to be Specializing in cross-site connections, Nunes et al. created a compromised somehow, due to misconfiguration or other, connected graph depicting their indicator attributions to the the researcher's primary machines would remain safe. AWS underground threat actor profiles. Lawrence et al. [3] provided an additional hop from the tor circuits, among continue to work in this direction with their product, D- many other development benefits. Secondly, one nuance with operating this project within Canada is the law cURL is a command-line utility that supports a surrounding illicit images found on the darknet. The current comprehensive set of protocols and is the backbone for minimum penalty for possession of, or "accessing," child numerous applications. cURL, coupled with Python and sexual exploitation material is six months of Bash, presented itself as another avenue to take to solve the imprisonment[1], and the amount of child sexual problem of scraping the darknet. With its robust feature set, exploitation material detected online by law enforcement speed, and extensive documentation, the utility showed and the private sector continues to increase[22]. The web promise. However, after attempting early builds of the crawlers and scrapers by design are as focused as possible to scraper, the technology was not modular enough for the prevent crawling outside the scope. A security mechanism needs of scraping multiple platforms without a substantial to avoid downloading illicit images is made possible by not restructuring of the supporting codebase, which in return crawling to URL contained within tags. This method became too time-consuming to continue developing with it. of risk mitigation avoids the collection of tags, losing very little analytical value. It altogether prevents the Python is an interpreted, high-level programming possession of any illicit images that may get us in trouble. language[18] combined with specific libraries such as All HTML files collected are then parsed, extracting only BeautifulSoup and Requests. It allows for handling HTTP relevant information and then delete promptly afterward. requests in a format easy to read and use. Lawrence et al. This aids in allocating more space for new data but, more used Python extensively while building a framework to importantly, protects the user scraping and crawling. crawl and scrape the darknet, from a practical standpoint, Python was the ideal technology that could get integrated Furthermore, Dittus et al. identified some marketplaces that into the stack and be the backbone of the crawler and outline a Terms of Service (ToS) restricting web crawling scraper. on their websites similar to legal operation businesses. To make data collection possible, the acknowledgment and The supporting stack included utilities such as cron, a time- accept the risk of violating these terms of service. Arguably, based job scheduler for Unix operating systems. Cron is the public marketplaces accessed are available to the public. essential to the project as it helps automate the crawling and scraping of the marketplaces. Rsync, a utility for transferring and synchronizing files between the directories IV. DATA COLLECTION and remote computers, was essential for relocating files, Scraping the darknet is almost the same process as scraping compressing the files at their destination, and backing up to the clear net. The HyperText Markup Language (“HTML”) a remote computer. of the page has the contents of interest; it is just a matter of extracting the data in a dynamic and practical format that Every marketplace poses its own unique challenges when enables flexible indexing. tackling the issues of crawling and scraping. Security mechanisms such as CAPTCHA, valid sessions, random The technical options to scrape the contents of pages are URL IDs, and rate-limiting were all challenges met while broad and it is not a novel function by nature of the designing and testing the application. The approach to internet[11]. There are many frameworks and heavily circumvent CAPTCHA and to acquire a valid session featured libraries that aid in making the scraping process required human interaction at the beginning. If the streamlined and efficient. Options for the use case of marketplace utilized encoded or random URL IDs to make it scraping the Tor Network[16] included utilizing, Apple harder to iterate, the crawler would navigate its way to find Script, Linux utility Curl, and Python coupled with a library every link on the marketplace instead, which would add a Beautiful Soup. significant amount of time to the process. Section IX discusses the implementation of a CAPTCHA solver. Apple Script is a proprietary scripting language invented by Apple Inc. that aids in the automation of Mac applications[17]. Similar projects tailored to scraping have used Apple Script to automate the process of scraping HTML and partially circumvent security roadblocks such as CAPTCHA. Martin Dittus et al. used Apple Script to automate the bulk of scraping, and when prompted by a CAPTCHA or other automation blockers, it would send the challenge to a human being to manually solve, and the script would continue to function[5]. Apple Script’s utility is compelling; however, due to the proprietary nature (i.e., being locked into using MacOS) and lack of scalability, it did not fall in the appropriate use case for this project.

Fig. 5. Psuedo code of data collection and parsing

Fig. 3. Darknet Scraper and Crawler Overview Rate limiting was the next hurdle. Every site on the internet V. DATA PROCESSING handles this differently and the same is true for darknet marketplaces. Some sites would return a 429 Request (Too After the collection process, the collector has provided Many Requests) appropriately when requests made to fast HTML files containing the original structure from the while crawling or scraping. Some marketplaces failed advertisements on the target marketplaces. Most of the silently, and a small portion did not have any rate limiting residue data is qualitative such as usernames, countries, whatsoever. Rate limiting is a more trivial problem to deal product categories, and many more; however, it is also with, figuring out the rate in which the crawler and scraper possible to extract quantitative data, such as sales, review to run is simply a matter of trial and error. Eventually, the count, view count per post, and post frequency by vendors. frequency at which the marketplace processes the requests The following paragraphs will describe an overview of the data processing pipeline and considerations regarding data are tuned to match that of a regular user, therefore circumventing the issue altogether. storage, data consistency/redundancy, data availability, and data scalability. It will discuss the choices made and the processes applied at each step to arrive at the final solution.

As described in previous sections, the crawler and scrapper push their files to an inbox, and the data parsers pick them up and push to an outbox, and Elasticsearch database. The primary reason to delegate particular tasks to each part of Fig. 4: Flow of a failed attempt at scraping the process was intentional to avoid bloating one piece of software with many features. The decoupling of The goal of scraping was to collect an inventory of all items responsibility provides the ability to change and modify listed for sale. These pages generally have to convey some code with ease since they are merely new class files based relationship with the buyer so that it would list information on parent objects with slight modifications to fit the varied such as base price, quantity, type of item, category, shipping requirements of many marketplaces. from and shipping to, as well as some more verbose information such as the vendor name, amount of items sold, last logged in time and more. Although this is not consistent across every marketplace, it enriches the data collected, and its usefulness to tell a story is further discussed in section VI.

missing with expected values. It is also possible to extend the DNDO at any time and maintain previous analytics. Overall, the flexibility offered creates a lower redundancy and higher consistency.

The third part of processing pipeline requirements includes the availability of the data. The entire lifespan, from collection to full-text search, and even data expiration, ensured the data was accessible in all forms. Managing data in lifecycles like this provided the freedom to experiment without concern about the loss of data. The worst-case scenario was re-processing and indexing the original HTML files, which at scale did not take more than a few minutes. However, paired with availability comes scalability. The decision to choose JSON was to ensure that the chosen format can adapt to any need during this project’s future Fig. 6: Darknet Stack System Design Overview work. The primary concern in this vector was to avoid redundancy and index what only the required information The First requirement examined was data storage. Data required to ensure the data was always fast to query. By storage was a significant consideration because of the utilizing Elasticsearch’s ability to predefine the data types of project's nature generating large amounts of files from the incoming JSON files, it is possible to tune and maintain this project's inception. As more and more files in a directory speed compared to ingesting raw text. accumulate, it can become quite costly due to the read/write overhead compared to a single larger file. Maintaining the CONSIDERATIONS INDEX HTML INDEX DNDO original HTML files' integrity was made possible because of Storage High Low these considerations. However, the entire file does not Requirements benefit the end goal of the analysis. The data processor Data Consistency Low High creates JSON objects from the original HTML files by Data Redundancy High Low extracting only the features required for analysis. This Data Availability Low High process saw a reduction of 167% on average Fig. 7., Data Scalability Low High yielding only the features expected to work with; the new Fig 8: Table of data considerations JSON file is from here on out referred to as the Darknet Data Object (DNDO). With this new DNDO providing the Finally, after careful consideration, the benefits and full flexibility of JSON Elasticsearch can utilize some downsides of extracting information from the raw HTML powerful indexing features that predefine data types for and converting into specific focused, and extendable JSON storage optimization. The decision to store the file locally objects that are smaller and more efficiently handle the during testing and index it into Elasticsearch 7.8 is possible issues of data redundancy and consistency. JSON allows but consumes more storage, which is discussed in further complete freedom to extend the object at any time and detail later. maintain previous analysis. The plethora of libraries and tools available to work with JSON make it is an obvious MARKETPLACE HTML AVG DNDO AVG REDUCTION choice. By design, high process interoperability allows for a Elite market 5.06 1.23 121.79% flexible shift and switch of the backend system should Icarus 14.37 0.61 183.52% future marketplace analysis require that. Breaking up the Aesan 76.05 0.68 196.42% system components into specific small services per TOTAL 167.24% marketplace allows the process to scale horizontally and Fig 7: HTML to JSON size reduction findings in KB vertically in the cloud with ease. Due to the nature of the process, cross-cloud interconnectivity is also possible so Second, the issue of data consistency and data redundancy. long as every cloud, private, public, or hybrid has access to With every darknet marketplace containing different the same Elasticsearch cluster. Furthermore, it is possible to structures and data layouts, it was essential to consider data shift focus entirely and modify the DNDO to meet a specific consistency for the cross-marketplace comparison. Without industry need, altogether avoiding marketplaces if required. consistent data features, the cross-marketplace quantitative analysis would not be possible. The significant benefits of switching from HTML to JSON were extracting and removing all HTML structure, CSS and JavaScript imports, VI. DATA ANALYSIS images, and repeating headers/footers. Reducing the Throughout all the marketplaces we analyzed, the old saying redundancy and overhead allows the information to remain of “Honor Among Thieves” stands true to darknet markets. consistent across all marketplaces and control values that are Of all the marketplaces scraped, each had strict rules on the the X-axis shows the origin country of shipping. As depicted sale of guns, child sexual exploitation material, and murder in Fig 10. most of the products are listed worldwide, France, for hire. Each marketplace observed specific one category, the United States, and Germany. which strengthens their reputation among users.

The primary goal of data analysis was to validate the output of the data collection process. The secondary goal of data collection and analysis was to set the stage for larger projects such as machine learning and predictive analysis. This section outlines the data analysis, specifically text analysis, statistical analysis (What happened), and diagnostic analysis (Why did it happen). Afterward, the data interpretation process involves the decision making behind how we chose to express our results and the groundwork for visualization. Finally, Kibana and the custom data interface are the tools of choice to provide the data visualization methods used on the current data sets. Fig. 10: ASEAN Category heatmap To make the most of the data, there was a high focus on the fields which contained non-null values across every darknet Elitemarket shows similar statistics with their shipping marketplace. The first field identified that met these criteria being 95% worldwide for both origin and destination was productClass. The productClass field contained two country; however, their focus is primarily on physical values, physical or digital. Classifying the marketplace goods. The first concerning piece of information extracted is posting based on the end product form factor made sense to that the primary origin country is worldwide. The use of start with since the two separate categories have very worldwide as an origin country is concerning for many different risk profiles. Approximately 66% of all products reasons; firstly, the vendor could be concealing their origin collected were Digital compared to the 33% physical items. country by listing their product origin country as worldwide Further, the data collected identified that most marketplaces instead of their actual destination. Secondly, the vendor may are highly targeted and tailored to specific categories of be truthful, which means their illicit drug supply chain is products. truly global, which makes apprehension difficult for any authority.

Fig. 11. EliteMarket Category heatmap Fig. 9. Product Class type Digital (left) vs. Physical (right) products Fig 11. shows Canada's primary exports as " and Chemicals" and "Counterfit items." The RCMP The ASEAN market data shows mostly digital goods instead Supplementary Estimates for the 2019-2020 budget predicts of physical. However, EliteMarket shows the opposite. only 15 Million Canadian dollars set aside for fighting Interpreting the data collected for the ASEAN Market shows cybercrime of all types, including the prevention of illicit that the marketplace primarily focuses on digital goods. The drugs[25]. top 5 users control 38% of the market posts, and 82% of the shipments can be worldwide. By creating a heatmap of data It is possible that a cross-market seller, DrunkDragon exists. for ASEAN, it was clear that the market specialized in DrunkDragon appears to sell 48% of all digital products digital products. The Y-axis shows the top 5 categories, and collected shown in Fig. 12. Although, there is no way to validate this, as it could merely be an imposter trying to flourish. achieves this by using an obfuscated public build immediate trust by using a well-known name. ledger, which allows anyone to receive or send transactions, but no single user or entity can observe the source, destination, or amount. Hence why most marketplaces have been moving to privacy-focused forms of to protect the buyer, seller, and escrow.

Fig. 14. 100% of sales/trades via Escrow

There is very little price disparity of digital goods on darknet market places due to an unlimited supply of the digital items sold. Some of the most common digital items seen include PDFS and video guides. The topics of these Fig. 12. Top Sellers globally documents and videos range from social engineering free

products from companies to illicit drug production at home. The darknet operates on a near-zero trust model. Every The sheer abundance of guides on the same topics dilutes transaction observed required escrow services. Escrow is a the market and drives prices to below 10 US Dollars (USD), trusted third party that holds funds and or the item until the as displayed in Fig. 15. The supply versus demand makes transaction has been completed by both parties, as shown the information extremely available to even those with in Fig. 13; this is standard procedure whenever an item is to limited funds. be traded or sold on any darknet marketplace recorded. This service is usually pay-per-transaction exclusively using cryptocurrency; it is either added to the item's total cost or handled entirely separate from the initial transaction.

Fig. 15. Price differences of digital goods

Quantities of items is a section almost every marketplace vendor has to advertise publicly, the most typical amount seen from the data collected is "Unlimited" or the max amount a vendor can enter in the field usually "999" or "99". Maximizing the quantity works in the vendor's favor to help disguise the amount of the product the vendor owns. It is also more convenient for the vendor to enter the maximum Fig. 13. Escrow Service flow value than the actual amount as it appears they have a more extensive operation, and it is easier to set and ignore A cryptocurrency tumbler, also known as a mixing service, is compared to updating stock. Therefore, this key and value typically used in the escrow process. It allows the user to pair is rendered useless from a data visualization standpoint pass the funds from one address to a chain of random as it gives almost no insight into what a vendor has in stock, bitcoin addresses and then finally to a new address in control as shown in Fig. 16. of the user. The randomization of bitcoin transfers supposedly improves the anonymity of that are not as privacy-focused, such as Bitcoin. Cryptocurrencies such as Monero (MNR) and are popular alternatives to the popular cryptocurrency Bitcoin. They provide the anonymity of transactions needed for these marketplaces to The ability to search for records was critical to the success of this application. The Elasticsearch connector made it possible to query the database for any of the provided fields. The "title" field provided the ability to search for common words across many records. Elasticsearch's ability to analyze the English language made it possible to search for a specific search term such as "Account" and return variations, such as the pluralization of that word. It is also possible to configure filters to search on any field available in the DNDO, or search across the entire message if unsure which field.

Fig. 16. Top 5 Quantities across marketplaces

VII. EXTERNAL INTERFACE PLATFORM POC

Although Kibana provides an intuitive and transparent way to display, search, and visualize the data collected, it cannot modify stored data. Modifying the fields prepended with "analyst_" in the DNDO allows the user to track their research and thoughts on each record. The lack of data modification is why it was required to go beyond the ElasticStack. Like Kibana, it is possible to leverage Fig. 18. FlaskDash search functionality Elasticsearch's extensive API. It is possible to use a platform like Flask and the open-source Elasticsearch client python The three buttons in the top right corner of every record are package to create a custom third party interface. Flask is a visual cues for quick status updates. These buttons include lightweight Web Server Gateway Interface (WSGI) web anayst_viewed, analyst_comments, and analyst_flagged. The application framework designed to make getting started analyst_viewed field is a boolean data type designed to quickly and easily, with the ability to scale up to complex identify if an analyst has quickly viewed the record. If the applications[23]. The Elasticsearch python library provides a analyst_viewed value is false, the button will remain grey, wrapper to connect, create, update, and delete records indicating the record is new or unread. The analyst_comment quickly. field data type is raw text. There is no upper limit on this field. However, by default, there is a built-in limit in the Introducing FlaskDash, a Python-based web application for HTTP (chunk handling) layer that limits requests to analysts and researchers to manage, flag, save notes, and 100mb[24]. By leaving notes and analysis, other analysts can create new insight into darknet marketplace analysis. In its review, or search on these notes to correlate a specific case if current state, the platform is incomplete. However, it a case ID is left there. The final button is analyst_flagged provides a base for future work to be extended upon in many and is used to identify records of importance quickly. A ways. The groundwork for this platform and the minimum record should be flagged if it required further analysis or is viable product was to host an instance of the Flask WGSI under current investigation. However, if these fields do not connecting to Elsaticsearach while supporting multiple index meet every team's needs, the JSON fields can be extended and record navigation. without corrupting the previous data, and new buttons created.

Fig. 19. FlaskDash analyst buttons

Although Kibana is a fully functioning dashboard and Fig. 17. FlaskDash Initial landing page overivew analyst platform itself, it cannot modify the data directly. FlaskDash, at this point, meets the minimal working product requirements, and there is much to be done discussed later in make strategic decisions. The open-source nature and low future works. The goal of this platform was to test the cost of entry into data mining the Darknet prove valuable to viability of viewing DNDO in an external interface and any organization wishing to venture into this landscape. creating an additional hop between the analyst and the Extending the technology stack with an external web darknet itself. Retaining the same abilities to search and view application provides the opportunity to meet the needs of the records was successful, and having this air-gap protects many. The Darknet, at first glance, appeared to be an the end analyst and their machine from the direct risks of endless pit of data. However, with the right tooling, the browsing the marketplace themselves. correct data extracted and visualized can tell an exciting story about what exists below the surface.

VIII. CONCLUSIONS IX. FUTURE WORK In conclusion, the crawler, scraper, parser, and web Many features had been removed off the list during application enabled the analysis of current and past development due to a lack of time. It was essential to focus marketplaces. The creation of a JSON based Darknet Data on primary functionality instead of the quality of life Object encompassed features identified from cross-market features. Since the project is result-driven, the sacrifice of analysis. The features extracted allowed the data some automated functionality in place of results built a visualization to display the real value of identifying substantial backlog of potential system extensions. emerging threats within the Darknet. A more intelligent parser that allows the user to remove During the initial research, previous marketplace data duplicates and modify values in the key pair of the JSON provided by the GWERN Darknet Archives assisted in the object via Regular expression (RegEX) would significantly organization of future data as well as templating valuable improve the accuracy and quality of the data parsed before it information within darknet marketplaces. However, the gets digested by the data visualizer. The creation of a more previous data proved to be of low analysis value due to dynamic parser that loads configuration files would be changes in technology. Primarily, the intelligence value had significantly beneficial. The ability to write REGEX expired and no longer provided useful context. Current expressions to a configuration file to extract fields for the darknet marketplaces offered a surprisingly consistent set of DNDO would save a considerable amount of time writing information. This information became the basis for the custom parsers for each media. A more extensive parser darknet data object. Having a scalable, consistent set of data would be required when expanding to new media types. allowed for easy cross-market analysis and targeted analysis of single marketplaces. A Canadian presence throughout Automating the end-to-end flow of the pipeline would every market was constant. To this extent, the first goal of tremendously help when trying to visualize the data. As it the project was successful. stands, an analyst must manually provide the workflow with Physical products appeared to be more prevalent within a registered account on the marketplace, and a valid session Canada, although digital products made up approximately token. Although an analyst will most likely be required to 66% of the 10,000 records collected. A significant issue register the account, the system should obtain a valid session with product attribution toward a specific country was the token and bypass the login pages itself in the future. The rest use of "worldwide" shipping. Most marketplaces contained of the pipeline from the crawling, scraping, and about 2-3% Canadian origin and 90% worldwide, leaving a visualization can and should be automated to assure data range of 2-90% Canadian origin possible. The dramatic accuracy. difference left low confidence in inaccurate attribution. CAPTCHA completion without human intervention via When targeting vendors, the vendor DrunkDragon, claimed proprietary software such as DeathByCaptcha would 46.59% of all sales (mostly digital). Looking at Canada significantly improve rate limiting issues met when working only, AtomikBomB sells 70% of all physical products in against rate limiting on various marketplaces. Having Canada. However, AtomikBomB is not in the global top 10, DeathByCaptcha also allows for faster scraping and attesting to the lack of a Canadian vendor presence within crawling since most darknet market places prompt users the Darknet. The gap between supply and demand opens up with Captchas after accessing a certain amount of posts, or a potentially increased risk for Canadians participating in the rate of requests exceeds the limit set by the marketplace. the exchange of illegal goods. Ultimately, matching the requests of a standard user slows the crawling process. However, to avoid delaying the Overall, the Darknet data object provided valuable in many crawling process, it is possible to multi-thread the ways. Building upon data consistency, availability, application with multiple Tor circuits. By utilizing various scalability, and removing redundancy, the cost per query Tor circuits, it would appear as if numerous unique users are and storage access decreased. The findings within the visiting the marketplace as per usual. analysis provide a useful context for Darknet operations. This context, paired with a qualified analyst, can be used to A goal of this project if the time allotted was to have a [7] G. Branwen, “Dark Net Market archives, 2011-2015,” http://www. gwern.net/Black-market%20archives, July completed analyst view, which would allow an analyst to 2015, Accessed: 2020-06-1. triage, flag, and comment on items and events of interest [8] M. Graczyk, K. Kinningham, “Automatic product based on data and keywords powered by Dark Net Data categorization for anonymous marketplaces”, Date Objects (DNDO) fed into the ELK stack. However, it lacks Unknown the feature set and functionality that initially envisioned for [9] “Elasticsearch,” 2020. [Online]. Available: it. As it stands, the current investigation platform is a bare- https://github.com/elasticsearch/elasticsearch bones example of what is possible. The first significant [10] S. Mirtaheri, M. Dincturk, S. Hooshmand, G. Bochmann, G.Jourdan, “A brief history of web crawler”, 2014 feature to get working is the analyst buttons. The current [11] S. Brin and L. Page, “The anatomy of a large-scale user interfaces support this functionality, but the appropriate hypertextual web search engine,” in Proceedings of the routes to post data to Elasticsearch has yet to be completed. seventh international conference on 7, Enabling pagination to enhance the interface and ser. WWW7. Amsterdam, The Netherlands, The Netherlands: Elsevier Science Publishers B. V., 1998, searchability of the data is also a critical missing piece. pp. 107–117. [Online]. Available: Allowing user registration with role-based access control http://dl.acm.org/citation.cfm?id=297805.297827 (RBAC) and Single sign-on (SSO) would make the platform [12] Karr, D., 2020. What Is The Clear Web, Deep Web, And one step closer to enterprise-ready. User account control ? » Martech Zone. [online] Martech Zone. would also enable us to continue work on the flagging and Available at: [Accessed 27 June 2020]. [13] Julian Brose ́us, Damien Rhumorbarbe, Marie Morelato, separate case management page to identify records in Ludovic Staehli, Quentin Rossy, A geographical analysis progress would improve workflow and usability. Finally, of trafficking on a popular darknet market, Forensic enabling an enterprise license on Elasticsearch and Science International configuring machine learning and Natural Language http://dx.doi.org/10.1016/j.forsciint.2017.05.021 Processing would allow users to scrape marketplaces of all [14] “Kibana, ” 2020. [Online]. Available: languages, translate and identify trends and anomalies. NLP https://github.com/elastic/kibana would also open up opportunities to scrape new information [15] “Maltego,” 2020. [Online]. Available: outlets such as chatrooms, forums, and websites. https://www.maltego.com/product-features/ [16] “Tor Project,” 2020. [Online]. Available: Visualizing the crawler and parser trails could also prove https://metrics.torproject.org/ useful—relationship visualization with technology such as a [17] “Introduction to AppleScript Language Guide”, 2020. graph database like Neo4J may provide a unique insight into [Online]. Available: https://developer.apple.com/library/archive/documentati the structure of the darknet. Graph databases could further on/AppleScript/Conceptual/AppleScriptLangGuide/intr enhance cross-marketplace analysis by linking adversary oduction/ASLR_intro.html pseudo names to other markets. [18] “What is Python”, 2020. [Online]. Available: https://docs.python.org/3/faq/general.html#what-is- python [19] Centre for International Governance Innovation https://www.cigionline.org/internet-survey-2019 access REFERENCES july 8 [20] Dark Web and Its Impact in Online Anonymity and [1] Martin, J., and Christin, N. Ethics in Cryptomarket Privacy:A Critical Analysis and Review, Research, International Journal of (2016), https://file.scirp.org/pdf/JCC_2019031914453643.pdf http://dx.doi.org/10.1016/j.drugpo.2016.05.006 [21] Flick, Catherine & Sandvik, Runa. (2013). Tor and the [2] E. Nunes, A. Diab, A. Gunn, E. marin, V. Mishra, V. Darknet: exploring the world of hidden services. Paliath, J. Robersson, J. Shakarian, A. Thart. P. 10.13140/2.1.2363.6809. Shakarian, “Darknet and deepnet mining for proactive [22] [22] European Union Agency for Law cybersecurity threat intelligence” Jul 2016 Enforcement Cooperation ”Cybercrime becoming bolder [3] H. Lawrence, A. Hughes, R. Tonic, C. Zou, “D-Miner: data centre of crime scene”, 2019, [Online], Available: A framework for mining, searching, visualizing, and https://www.europol.europa.eu/newsroom/news/cybercr alerting on darknet events”, 2017 ime-becoming-bolder-data-centre-of-crime-scene, Accessed July 13, 2020 [4] D. Hayes, F.Cappa, J. Cardon, “A Framework for more effective dark web marketplace investigations”, May [23] “Flask” 2020 [Online] Availble: 2018 https://palletsprojects.com/p/flask/ [5] Martin Dittus, Joss Wright, and Mark Graham. 2018. [24] “ElasticSearch Community” 2020 [Online] Available: Platform Criminalism: The ‘Last-Mile’ Geography of https://discuss.elastic.co/t/size-limitations/4065/3 the Darknet Market Supply Chain. In WWW 2018: The [25] “Supplementary Estimates A 2019-2020 – Appearance 2018 Web Conference, April 23–27, 2018, Lyon, of the Minister of National Defence Before the France. ACM, New York, NY, USA, 10 pages. Committee of the Whole” 2020 [Online] Availble: https://doi.org/10.1145/3178876.3186094 https://www.canada.ca/en/department-national- [6] S. Mittal, A. Joshi, T. Finn “Cyber-All-Intel: An AI for defence/corporate/reports-publications/proactive- Security related Threat Intelligence”, May 2019 disclosure/vac-estimates-budget/rcmp-estimates- budget.html

Appendix A

CALCULATION OF FILE SIZE REDUCTION

In order to calculate the file size reduction consistently we utilized some online tools that where modeled after the following equation:

The calculator is available online here. Values v1 and v2 were obtained with the following bash command inside the html folder and json folder respectively in order to collect the average file size of each file. The command works by continuously summing the filezie column to produce an average of the sums. find ./ -ls | awk '{sum += $7; n++;} END {print sum/n;}' Appendix C

Escrow, Pricing, and Quanitiy Issues

100% payment escrow – very good escrow, compare to banks escrow etc.. SORTING FILES

Accurately pulling URLs that were associated with items in the marketplace required the linux utility grep to output all URLs that were between a certain to characters in length. grep -E '^.{114,120}$' infile

Appendix B

Most digital items are so cheap because easy to reproduce Data Analysis: ASEAN Market

Table representing the top 5 sellers controlling 37% of market posts collected seller.keyword: Descending Count DrunkDragon 1,114 GoldApple 362 OnePiece 268 TheShop 258 PMS 123

Figure showing the worldwide shipping destination

Appendix D

FlaskDash investigation platform FlaskDash initial screen displaying some newly ingested

Example of the FlaskDash meta data inside DNDO FlaskDash supports multiple elastic indexs via a dropdown menu in its interface.

The search funcationality of flaskDash using search terms “account” in the index “Market_asean”

Analyst Viewed, Comments, Flagged fields represented as icons for quicker identification of already viewed alerts displaying the color identification of a completed field.

Comments interface modal popup after clicking the item for the respective record. This interface provides a text area and submit button to perform an HTTP Post to the ElasticSearch database. Appendix E Code Resources

All code resources required to deuplicate the project can be found here: https://github.com/n0tj/Darknet-Stack

Appendix F self.analyst_closedDate = None self.analyst_dateCollected = None Miscellaneous items

Example JSON dark net data object populated with values def Post(self, title, seller, extracted from asean market. category,creationDate, url, views, purchases,expire, productClass, originCountry, shippingDestinations, quantity, price, payment, analyst_hasViewed, analyst_viewDate, analyst_flagged, analyst_notes, analyst_closedDate, analyst_dateCollected): self.title = title self.seller = seller self.category = category self.creationDate = creationDate Python Code example of Darknet Data Object self.url = url self.views = views self.purchases = purchases # Filename: DNDO.py self.expire = expire # Description: Dark Net Data Object self.productClass = productClass (DNDO) - Like features across markets for self.originCountry = originCountry ES indexing self.shippingDestinations = # Version: 1 shippingDestinations # Date: June 29 2020 self.quantity = quantity # Author: Edward Crowder + Jay self.price = price Lansiquot self.payment = payment

#None scrapped data class post: self.analyst_hasViewed def __init__(self): self.analyst_flagged self.title = None self.analyst_notes self.seller = None self.analyst_viewDate self.category = None self.analyst_closedDate self.creationDate = None self.analyst_dateCollected = self.url = None analyst_dateCollected self.views = None

self.purchases = None def toDict(self): self.expire = None print(self.__dict__) self.productClass = None

self.originCountry = None

self.shippingDestinations = None self.quantity = None self.payment = None self.price = None #Non scrapped data self.analyst_hasViewed = None self.analyst_viewDate = None self.analyst_flagged = None self.analyst_notes = None