<<

Masaryk University Faculty of Informatics

Automated Collection of Open Source Intelligence

Master’s Thesis

Bc. Ondřej Zoder

Brno, Fall 2020 Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or ex- cerpted during elaboration of this work are properly cited and listed in com- plete reference to the due source.

Bc. Ondřej Zoder

Advisor: RNDr. Lukáš Němec

i Acknowledgements

I would like to thank my advisor RNDr. Lukáš Němec for his guidance throughout the entirety of this thesis. My thanks also go to RNDr. Martin Stehlík, Ph.D. who provided many valuable suggestions and helped shape the tool that is the outcome of this thesis. Huge appreciation goes to my family for the support they have given me during all the years of my studies. I also want to thank Cedric from CIRCL.LU for providing me free access to their Passive DNS and Passive SSL databases, and Gregory from Spyse for giving me a free trial for their port discovery service, which allowed me to extend Pantomath and further test the reliability estimation model.

ii Abstract

With the ever-growing amount of data available on the Internet and the wide- spread adoption of social media networks, publicly accessible websites have grown into a goldmine of valuable about individuals and com- panies. Open Source Intelligence, shortly OSINT, is any information obtain- able legally and ethically from publicly available sources addressing specific intelligence requirements. The relatively easy and cheap integration makes OSINT a practical solution for national security, cyber threat intelligence, and many other fields. This thesis presents a framework called Pantomath for an automated collection of OSINT that utilizes many existing tools and services. The framework is highly modular, provides all the functionality needed throughout the whole process of OSINT, offers three modes of oper- ation for different anonymity requirements, and presents the data in a struc- tured output. The reliability of some of the collected data is estimated to allow the user to analyze the data more efficiently and precisely. The frame- work is compared to existing OSINT automation tools, and the most notable advantages and disadvantages are discussed.

iii Keywords

OSINT, open-source intelligence, OSINT automation, military intelligence, social media intelligence, threat intelligence, Pantomath

iv Contents

1 Introduction 1

2 Open Source Intelligence 3 2.1 Challenges ...... 5 2.1.1 Legal and Ethical Aspects ...... 6 2.2 Value and Use Cases ...... 8 2.2.1 Military Intelligence ...... 9 2.2.2 Cybersecurity ...... 10 2.2.3 Social and Intelligence ...... 11 2.3 State-of-the-Art ...... 12 2.3.1 Natural Language Processing ...... 13 2.3.2 Machine Learning ...... 14

3 OSINT Sources and Tools 17 3.1 Overview ...... 17 3.2 OSINT Automation ...... 23 3.2.1 Recon-ng ...... 24 3.2.2 Maltego ...... 25 3.2.3 SpiderFoot ...... 26

4 Pantomath: Tool for Automated OSINT Collection 28 4.1 Problem Statement ...... 28 4.2 Architecture and Functionality ...... 30 4.2.1 Base Framework ...... 31 4.2.2 Modes of Operation ...... 33 4.2.3 Modules ...... 35 4.3 Reliability Estimation ...... 41 4.3.1 Cyber Threat Intelligence ...... 44 4.3.2 Geolocation ...... 45 4.3.3 Port Discovery ...... 47

5 Evaluation and Discussion 49 5.1 Evaluation of Reliability Estimation ...... 49 5.1.1 Cyber Threat Intelligence ...... 49 5.1.2 Geolocation ...... 50 5.1.3 Port Discovery ...... 53 5.2 Comparison with Existing Tools ...... 54 5.3 Future Work ...... 56

v 6 Conclusions 59

Bibliography 60

A Appendices 73

vi 1 Introduction

With the exponential growth of the Internet in the last few decades, the amount of data stored around the world has become immeasurable. It is estimated that four of the biggest online companies, Amazon, Microsoft, Google, and Facebook, store at least 1.2 million terabytes of data. At first, data was thought of as a mere by-product of computing, but it has even- tually grown into a product itself [1]. Companies sell their users’ data to others that benefit from it, so collecting data of any value is essential for many. A large portion of Internet data is accessible to anyone with an In- ternet connection and often contains a lot of knowledge about individuals, companies, or governments. All this data is commonly called Open Source Intelligence, or shortly OSINT. The value of OSINT is increasingly getting recognized in many different fields. According to [2], over 80% of the knowledge used for policymaking on a national level is derived from OSINT. Cyber threat intelligence heavily utilizes OSINT and combines it with data collected by security devices to evaluate possible threats to companies’ infrastructures. All in all, publicly available sources constitute an irreplaceable source of knowledge. However, due to the immense amount of data on the Internet and its unstructured and heterogeneous nature, the collection and processing of OSINT is a challeng- ing task requiring non-trivial methods. Arguably one of the biggest draw- backs of OSINT is the lack of mechanisms for verification of the collected information [3]. To make the whole process of OSINT easier and more accessible, various tools and services that provide useful information exist. These range from simple websites that provide basic information about IP addresses to more complex tools implementing state-of-the-art algorithms, such as Shodan [4]. A framework called Pantomath for an automated collection of OSINT is pre- sented in this thesis. The framework utilizes existing tools and services that provide valuable information about Internet identifiers, such as IP addresses or domain names. As the number of these services is enormous, Pantomath was designed to make the integration of new sources more straightforward by moving the data collection to separate modules, which can be added by merely implementing a well-defined interface. To address the user’s anonymity requirements, Pantomath offers three modes of operation with varying guarantees and drawbacks. The overt mode represents a regular operation where all sources are used, and an Internet connection is required. In the stealth mode, all requests sent to the Internet

1 1. Introduction are proxied through the Tor network. The offline mode provides the highest guarantees for the user’s anonymity, as only a database of preprocessed data is queried, and no Internet connection is needed. Pantomath also attempts to tackle possibly the biggest challenge of OSINT – the validation of the gath- ered data. A mathematical model for reliability estimation of the results is defined and used in several modules. The thesis is organized as follows. Chapter 2 introduces OSINT, discusses some of the challenges, the value it provides, the fields where OSINT is often utilized, and a few state-of-the-art techniques that improve the efficiency of OSINT collection. Chapter 3 outlines the sources that can be used to gather the data and some tools that aim to automatize this process. Pantomath, a tool for an automated collection of OSINT, is presented in Chapter 4. Chapter 5 evaluates Pantomath, compares it to tools with similar goals, and drafts the possible extensions and improvements.

2 2 Open Source Intelligence

Intelligence is a process of information gathering for the purpose of providing a clear understanding of issues, allowing responsible people to make indepen- dent and impartial decisions [3]. Thomas Fingar [5] states that the primary purpose of intelligence is to reduce uncertainty about intentions, capabilities, and actions of adversaries and allies. To be of any value, intelligence must be up-to-date, accurate, relevant, and verifiable. The goal of intelligence is not only to collect data but also to identify parts of the data that are valuable for the issue at hand, link them together, and evaluate them. Open Source Intelligence (OSINT) is an intelligence based on information that can be obtained legally and ethically from publicly available sources [6]. OSINT is considered to be the oldest form of intelligence gathering, with its earliest usage going as far as the Second World War, where radio and print sources were used [7]. However, its utility increased significantly with the emergence of information technologies and the Internet in particular [8]. It is estimated that over 80% of knowledge used for policymaking on a na- tional level is derived from OSINT [2, 9]. OSINT is a broad term, and the exact definitions can vary depending on the field of study. The Office of the Director of National Intelligence of the U.S. [10] defines it as intelligence produced from publicly available infor- mation that is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement. They state that the sources of OSINT include mass media, public data, gray literature, and observation and reporting. For the purpose of this thesis, the definition will be narrowed down to intelligence based on information openly accessible over the Internet. The Internet itself is not considered as a source of OSINT, but rather a platform through which the sources are accessed. There are borderline cases of sources that might not be regarded as part of OSINT by some definitions, e.g., any private information that was made public even though it was not the intention of the owner of that information. That might occur due to some error, e.g., a misconfiguration of the system containing this information or because a third party published it. Examples of such sources are WikiLeaks [11] or any data leaks that are available on the Internet. This thesis considers this type of information as OSINT. For example, discovering a vulnerability in a system is recognized as OSINT, exploiting this vulnerability to bypass the security of the system and gain some information from the inside is not.

3 2. Open Source Intelligence

Publicly available data can be divided into four different categories [12], as illustrated in Figure 2.1. Open Source Data (OSD) are any publicly avail- able data that are not refined in any way, e.g., an image or raw social media data. Open Source Information (OSINF) are OSD that have undergone fil- tering, extraction of valuable information, and editing, for example, articles and results from search engine queries. Open Source Intelligence (OSINT) is a collection of OSINF that addresses a specific intelligence requirement, with Validated Open Source Intelligence (OSINT-V) going a step further by validating the OSINT using supporting information. Data from all these four categories can be used for OSINT. However, Open Source Data and Open Source Information require further assessment. Specific sources that can be used for OSINT are discussed in Chapter 3.

Figure 2.1: Description of Open Source Data, Open Source Information, and (Validated) Open Source Intelligence and how these categories differ in terms of the level of data transformation [12].

4 2. Open Source Intelligence

The cycle of intelligence starts with the identification of the intelligence requirements, the transformation of these requirements into specific queries, and the selection of sources and tools to be used for the data aggregation [12]. Once the data are collected, they are processed and converted into intelli- gence addressing the requirements. Any relevant information discovered dur- ing the previous steps can be used for further investigation by formulating new requirements and repeating the cycle. Correct visualization of the collected data is just as crucial as the collec- tion itself because it can help with the evaluation of the data. For example, when the goal of the investigation is to find as much information about an IP address as possible, the OSINT data may contain many other IP addresses that are related to the target, domain names acquired by a reverse DNS resolution, or e-mail addresses that sent unsolicited messages from the tar- get IP address. Visualization of these results can provide a good overview of the structure and improve the understanding of the relationships between different elements, e.g., by discovering related social media accounts that are not directly connected [13].

2.1 Challenges

Due to the immense amount of data publicly available on the Internet, OSINT collection and analysis is a complex procedure. The data are also predominantly unstructured and very heterogeneous, and distinguishing be- tween something of value and unrelated information requires a thorough eval- uation. Various complex algorithms and methods are utilized to deal with the diversity and volume of OSINT data during all phases of the . Machine learning can help with the classification and clustering of the re- sults and extraction of additional information with natural language pro- cessing. Using de-anonymization, one can connect different identifiers, thus broadening the amount of information about a particular target. An algo- rithm called dimension reduction can decrease the amount of data that needs to be processed by extracting features from the data. Even though a large part of the Internet data is in English, valuable information can be found in other languages as well. Machine translation that can convert the mean- ing of more complex language structures, such as abbreviations or phrases, can increase the amount of information found during the investigation [14]. Some of the more innovative approaches to OSINT using these algorithms are described in Section 2.3.

5 2. Open Source Intelligence

Information retrieval is often compared to finding answers to questions [15], which means the formulation of the OSINT search queries in a simple and unambiguous way is of high importance. However, this is not always entirely achievable, and a particular query might result in different, often contradict- ing results. That is amplified by the fact that the amount of data available on the Internet is enormous, the information is mostly unstructured and often false or misleading [16]. Indeed, one of the frequently cited disadvan- tages of OSINT is the lack of mechanisms for verification and evaluation of the collected information, which is especially true for information gath- ered via the Internet [3]. This proves to be an issue when information about possible cyber threats is collected from publicly available sources [17]. The quality of the results can be evaluated by establishing the credibil- ity and independence of the sources of the information and using multiple sources and comparing the results [18]. However, as stated before, having the correct answers to the question does not automatically render the infor- mation useful, as the question might not have been defined precisely. To find information that is relevant to the user’s inquiry, context and query-specific knowledge are essential [19].

2.1.1 Legal and Ethical Aspects

Besides some of the technical challenges, OSINT also brings additional ethi- cal and legal issues. Although the collection of OSINT should be by definition legal since only publicly accessible data are considered, the line between eth- ical and ill-willed usage of the data is not completely clear [20]. Extremely sensitive personal information such as sexual orientation, religion, or political beliefs can be inferred even when these are not explicitly stated [21]. The com- bination of multiple sources and the use of state-of-the-art techniques that derive valuable information is what elevates mere data into a powerful tool. According to some, the collection of OSINT is not much different from someone reading a newspaper since the information is public in both cases. However, it is the institutionalization of OSINT that raises concerns even within the intelligence community [22]. As is apparent from leaks of classified documents such as those released by Edward Snowden, the mass performed by governments is not only focused on certain suspicious individ- uals but rather omnipresent. The intelligence agencies of the world’s biggest economies, such as the US, have enough resources to find virtually anything about individual people the Internet has to offer [1].

6 2. Open Source Intelligence

The biggest issue that arises with such powerful knowledge is the po- tential harm to the targeted individuals [23]. The EU’s General Protection (GDPR) and other emerging privacy are an incen- tive for companies and individuals to carefully handle data to prevent any direct or indirect leakage of personal information [24]. According to GDPR, different pieces of information that together identify a specific person also constitute personal data. This creates a non-trivial task to handle for com- panies that sell pseudonymized data, as irreversible anonymization remains an unresolved issue. Having regulations such as GDPR only solves a part of the problem since users can voluntarily publish their personal information. Nonetheless, the push for better privacy of Internet users could potentially decrease the value of OSINT. Both GDPR and Privacy by Design [25], an essential concept from GDPR addressing users’ privacy, are also partially applicable to OSINT (or any data collection process). By adhering to these principles, the OSINT investigators can perform the tasks at hand as ethically and safely for the targeted indi- viduals as possible in the given scenario. The principles can be summarized as follows [12]:

• minimize the amount of collected personal data

• use the data only for the specified purpose

• only a pre-defined set of people should access the data

• delete the data once it serves the purpose

These principles cannot be fully embedded in OSINT platforms in a legally binding manner [26]. However, legal and ethical safeguards can be builtin to allow the users to determine to what extent they comply with GDPR and Privacy by Design approach. These safeguards could be strengthened by us- ing a markup language that would allow the users to specify access control policies, data removal enforcement, and other privacy requirements in an au- tomated manner. Casanovas [27] defines a regulatory model that applies Privacy by Design principles to OSINT investigations. This model mainly focuses on the analyses performed by governmental agencies and aims to set appropriate legal boundaries. Rajamäki and Simola [28] explore the neces- sary extensions and changes that would need to be carried out in an existing maritime surveillance project to implement Privacy by Design architecture.

7 2. Open Source Intelligence 2.2 Value and Use Cases

Although there are some challenges when utilizing OSINT, the sheer amount of information available on the Internet makes OSINT a viable solution for national security, cyber threat intelligence, and many other fields. The incre- mental improvements in the performance of computers throughout the past have made OSINT more and more relevant. However, it was the emergence of big data [29] and machine learning that made huge amounts of data available at a much more rapid pace and with more valuable information extracted from it. These are described later in Section 2.3. With the widespread adoption of social media networks, the availability of private personal information has increased considerably. Combined with the fact that many users are unaware that large portions of the informa- tion they share could be accessible to anyone, social media in particular are a goldmine of OSINT [30]. Data from these websites can be used for research and other business-related use cases, but also for malicious activities, such as phishing [31]. Companies also need to be aware of the information that can be found about them publicly, as possession of a complete collec- tion of such information could give a potential attacker a lot of valuable knowledge as to how the company could be exploited. The dark web offers strong anonymity and privacy guarantees for users that wish to participate in illegal activities [32]. This fact alone makes the dark web a very fruitful source of information, especially for govern- ment agencies fighting against crime. The strong anonymity comes hand in hand with the necessity to use specialized software to collect data from the dark web [33], such as search engines and web crawlers that are able to index hidden services. A framework called BlackWidow proposed by Schafer et al. [34] brings together various tools for collection and analysis of the con- tent to gather information related to cybersecurity and fraud monitoring. There is a body of research focusing on the detection of activities of interna- tional terrorist groups on the dark web and the education of agencies fighting against these groups [35]. As shown in Figure 2.2, most of the use cases of OSINT can be divided into three main categories – detection of organized crime, cybersecurity, and social media and sentiment analysis. Many niche use cases do not necessar- ily fall within any of these categories, such as used by companies to research potential markets for their products or cybersecu- rity from the attacker’s perspective. Therefore, this thesis generalizes these categories to military intelligence, cybersecurity from the viewpoint of both attackers and defenders, and social and .

8 2. Open Source Intelligence

Figure 2.2: The three main types of use cases for OSINT [36].

2.2.1 Military Intelligence

With the widespread adoption of social media networks, discussion forums, and other forms of over the Internet by the general popula- tion, criminal and terrorist migrated a significant part of their communication there as well. Even though many instant messaging applica- tions provide end-to-end encryption, open forms of communication remain an attractive medium. Additionally, the dark web is often used as a market- place for illegal goods such as drugs, guns, or even hitman services. Besides the monitoring of criminal groups’ activities, governments can utilize OS- INT to detect a growing discontent of the population, the emergence of new political movements, and other indicators of possible threats. For military purposes, the collection of OSINT is generally more straight- forward, less expensive, and safer than the gathering of intelligence from covert sources [37]. Governments increasingly recognize these advantages,

9 2. Open Source Intelligence and OSINT is used in combination with established forms of intelligence. The benefits of the fusion of OSINT into existing sources include the initial establishment of an intelligence objective, validation of information obtained from covert sources, expansion of the existing knowledge, and replacement of the same knowledge found in covert sources to protect these sources when presenting evidence [38]. The particular uses of OSINT for military intelligence and fighting against organized crime are varied. Scrivens et al. [39] combine data gathered by a web crawler specializing in extremist content and a novel sentiment analy- sis tool. They conclude that sentiment analysis might significantly improve the detection of extremism on public websites. Susnea [40] suggests that unexpected events such as natural disasters are usually followed by many social media posts, including images, video recordings, and detailed infor- mation, providing governments with better insights into what is happening. Ball [41] proposes the utilization of automatic analysis of social networks for the prevention of potential terrorist attacks. Dawson et al. [42] analyze how OSINT tools can link various Twitter posts to the African terrorist group Boko Haram.

2.2.2 Cybersecurity

The world of OSINT provides a lot of valuable knowledge for companies con- cerned about the security of their infrastructures, such as information about the ever-evolving landscape of cybersecurity threats, details about new vul- nerabilities that are constantly getting discovered, or reports about recent security incidents. All these pieces of information constitute a goldmine of intelligence that anyone can easily and freely utilize. Just as companies can employ OSINT to get a better understanding of potential threats, the at- tackers might use the publicly available information to explore the Internet presence of the companies, find any possible loopholes for exploitation, or shape a strategy for a phishing campaign [43]. Chapter 3 discusses some existing tools suitable for these tasks, for exam- ple, tools that determine what software a website is using, which ports are open on a particular IP address including the services running on there, or people and e-mail addresses associated with a company. Edwards et al. [44] demonstrated that a large-scale collection of information required for a social engineering campaign could be carried out completely automatically with no active communication with the targets. They did so by gathering contact in- formation of all employees publicly affiliated with a company, tracking down

10 2. Open Source Intelligence other employees through social media networks, obtaining their personal in- formation, and so forth. Hayes and Cappa [45] performed a thorough evaluation of the critical infrastructure of a company operating in the U.S. electrical grid using only publicly available data. They created a complete overview of the infrastruc- ture, including specifics about the hardware and software used on various ma- chines, outlined potential vulnerabilities within the infrastructure, and dis- covered the company’s employees, including their e-mail addresses. The au- thors conclude that a continuous collection of OSINT targeted at a particular company could provide attackers with powerful knowledge, and companies should pay more attention to what information about them is publicly ac- cessible. Cartagena et al. [46] performed a similar analysis and were able to achieve comparable results. Tanaka and Kashima [47] propose a URL blacklist based solely on OS- INT, and they show that 75% of the blacklist’s values are unknown to Google Safe Browsing. Additionally, 23% of the malware used in these URLs is also unknown. Quick and Choo [48] incorporate OSINT in digital forensic analysis to add value to the data and aid with timely extraction of required evidence. Vacas et al. [49] propose an automated approach for the collection of new knowledge from OSINT for detection rules in intrusion detection systems. The method was tested on real-world network traffic, proving it can detect malicious activities within a network. Lee et al. [50] combine OSINT with events detected by security devices to improve the knowledge of potential threats.

2.2.3 Social and Business Intelligence

How the general population feels about specific topics has always been an im- portant aspect when designing campaigns, deciding how a product should be built to suit its users’ needs, or even formulating a manifesto be- fore an election. With the recent advent of sentiment analysis algorithms that can evaluate users’ opinions just from their posts, OSINT has become an attractive source of information for social and business intelligence. Data from social media networks, discussion forums, and other websites where users often share opinions on different topics can be collected and evaluated to get a grasp of the overall public opinion [51]. Neri et al. [52] performed a sentiment analysis of 1000 new articles related to a public scandal of the former Italian Prime Minister Silvio Berlusconi. The primary goal was to detect whether there was a coordinated press cam- paign by evaluating the time and geographical distribution of the articles and

11 2. Open Source Intelligence the proportion of positive and negative opinions. Fleisher [53] lays down how OSINT affects competitive and marketing intelligence, details the biggest challenges when utilizing it, and outlines the best practices for successful utilization of public sources.

2.3 State-of-the-Art

Performing OSINT is a challenging task and requires a systematic approach and advanced technology. Just like in other fields that need to manage large volumes of data with no pre-defined structure, various techniques can be utilized to make the process more efficient. This section describes some of the innovative approaches utilizing sophisticated algorithms for data aggre- gation, extraction of valuable information, classification of the results, and other aspects of OSINT. Magalhães and Magalhães [54] present an OSINT tool named TExtractor that extracts text from audio and video and searches for valuable informa- tion. Specifically, the tool tries to detect keywords associated with cyber attacks. The text is transcribed from the source by using speech recognition tools, translated into English, and inspected to detect the specified keywords. The accuracy of TExtractor when detecting audio and video referencing cy- ber attacks is between 60% and 70%. Maciolek and Dobrowolski [55] present a system for aggregation and anal- ysis of OSINT based on modern data processing approaches. The system enables ad hoc queries for the collection of OSINT for a particular target by utilizing Big Data techniques, specifically a MapReduce model. Multiple data collectors, such as a web crawler and relational databases collector, are implemented, and new extensions can be included using the universal REST interface. Once the data are gathered, the content is extracted, tagged, clas- sified, and possibly translated to English from other languages. Gong et al. [56] propose a model for reliability estimation of CTI feeds. The data from each feed are normalized to have the same format. The main features, such as the risk level of a network resource or the number of IP ad- dresses associated with a specific attack are extracted, and categorical values are transformed into numerical values. While using these numerical features, all feeds are compared to each other, and the average of the differences to other feeds is used as the independence factor of each feed. The average of all data collected from the feeds for various network resources (IP addresses, domains, and file hashes) is used as the expected value, and the error of each feed is computed as the deviation from the expected value. Finally, the reli-

12 2. Open Source Intelligence ability of the CTI feed is determined by its independence and error and is continuously updated with new values to reflect the current situation.

2.3.1 Natural Language Processing

Natural Language Processing (NLP) is an interdisciplinary research area in- cluding computer science, linguistics, machine learning, and statistics. The pri- mary goal of NLP is to process natural language data in order to analyze and understand the characteristics and meaning of both text and speech [57]. Al- gorithms using NLP have been adopted by a range of software dealing with large amounts of natural language data, including search engines such as Google [58] or voice assistants such as Apple Siri [59], and it is the basic building block of many advanced solutions for OSINT. Noubours et al. [60] argue that NLP is a necessity for effective OSINT data aggregation and analysis. Their framework aims to help the human analyst with data acquisition, management, and analysis by using various NLP techniques. The data are collected by both manual web search and NLP-aided web crawling starting from pre-defined set of URLs. All the data are then processed into a structured output for more effective data manipu- lation and search queries. The analysis itself is done manually by an OSINT investigator that can utilize advanced search queries, content-based informa- tion filtering, text classification, machine translation, data visualization, and generation of reports. Social media account identifiers, such as usernames and e-mail addresses, can be easily manipulated so that two different accounts cannot be linked together. Different supplementary mechanisms can be used together with di- rect linkage using identifiers to increase the likelihood of successful account correlation. Layton et al. [61] examine how an authorship analysis can help to link social media accounts without direct evidence. They design a method that takes pieces of text created by accounts with different identifiers, pro- files them, and determines whether they are related based on an empirical threshold. The authorship analysis has an accuracy of 84% and goes up to 90% when using the computed threshold. Li et al. [62] propose a framework for the analysis of unstructured text to gather cyber-threat-related information. The framework utilizes NLP to extract cybersecurity event data from the infrastructure of a company, arti- cles about cybersecurity, and Common Vulnerabilities and Exposures (CVE) entries. These are combined to create profiles of potential threat actors, the methods they might use, and their possible targets. The collected data

13 2. Open Source Intelligence are used to train a machine learning model, which achieved a precision of 85%. Ganino et al. [63] explore the role ontologies might play in the interpre- tation of data collected from unstructured sources on the Internet. Ontolo- gies [64] describe an application domain by defining its terms, identifying possible relationships, and establishing constraints. Design of ontologies is a complex procedure and cannot be completely automated. Ganino et al. assume the availability of ontologies that are already constructed. They fo- cus on the population of ontologies with data collected from the Internet and utilization of this technology for more efficient analysis of unstructured OSINT data. Serrano et al. [65] propose a knowledge discovery system that combines techniques for information extraction and aggregation. The system merges data retrieved by pattern mining and ontologies and connects different pieces of information to create more comprehensive knowledge. They test their system on Global Terrorism Database [66] and show that systems using both pattern mining and ontologies can achieve better results than systems using only one of these techniques.

2.3.2 Machine Learning

Machine learning (ML) is a subfield of artificial intelligence studying algo- rithms that build a model based on a so-called training data set and identify patterns or make predictions with little to no human intervention [67]. There are three paradigms in ML that differ in the type of information they pro- vide and the way the model is trained. Supervised learning uses data set with known attributes to train the model for the classification of data, unsuper- vised learning can find patterns or relationships in the data, and reinforce- ment learning decides on future actions of software agents in an environment based on the state of the environment. ML is utilized across various different applications, such as recommendation systems, autonomous vehicle control, speech recognition, or computer vision [68]. Pellet et al. [69] propose a method for the localization of social network users by combining OSINT and ML. Firstly, data from Facebook, Twitter, and Instagram are collected using APIs of these websites. Certain features are extracted from the data, namely geolocations (if available), IP addresses, usernames, and relationships between different users, which are then trans- formed into social graphs. The relationships are determined based on users tagging each other in posts and other direct public interactions. The geolo- cation resolution starts with seeds found in the data. These include geoloca-

14 2. Open Source Intelligence tions pinned directly in the posts, places mentioned in the posts themselves, and geolocations found by CLAVIN (Cartographic Location And Vicinity INdexer) [70], a tool that uses NLP and ML. The combination of user re- lationships and geolocation seeds is used to determine past, present, and possibly future physical locations of the users. The achieved accuracy is over 77%. Ranade et al. [71] utilize deep learning for the translation of CTI data from different languages to English. The translation is optimized specifically for cybersecurity-related data by creating translation mappings for all key- words that are associated with cybersecurity. Data collected from public sources such as Twitter that are in other languages are preprocessed using NLP and a translation framework that adopts deep learning, and the trans- lation mappings translate all relevant parts of the data. Once the data are translated to English, they can be fed to CTI systems, thus broadening the amount of gained CTI. The system focuses on Russian but can be ex- tended for other languages as well by simply creating mappings for the re- spective language. Alves et al. [72] present a framework for the collection and classifica- tion of Twitter posts. The main goal is to gather security-related informa- tion from these posts and provide them to Security Information and Event Management (SIEM) systems, which employ event data to handle the se- curity management of an . The premise is that many security experts use Twitter to post short messages about security-related news in near real-time. These messages are normalized, the features are extracted, and the messages are classified by a supervised ML model and clustered. Each cluster is analyzed using named entity recognizers [73], and the crucial components (attack, vector, target) are retrieved. Deliu et al. [74] investigate how ML and Neural Networks may help with the collection of CTI from hacker forums and other social platforms. Specifically, they employ supervised ML and Convolutional Neural Networks (CNN) [75] to classify the posts from these platforms. The results show that the ML classifier performs at least as well as the CNN one, with the accuracy of both being approximately 98%. As CNN classifiers tend to be rather com- plex and expensive when used in practical scenarios, having ML classifiers with the same accuracy might increase the chances of companies utilizing classifiers for CTI. Mittal et al. [76] present a system for extraction and analysis of cyber- security information collected from multiple sources, including national vul- nerability databases, social networks, and dark web vulnerability markets. The system uses NLP to remove unnecessary parts of the textual data and

15 2. Open Source Intelligence extract parts of the data related to cybersecurity. The preprocessed data are stored in complex structures representing the information about differ- ent entities and their relationships. The knowledge is pro-actively improved by utilizing supervised ML and deep learning. The system is able to alert the user based on some pre-defined rules and enables complex search queries.

16 3 OSINT Sources and Tools

This chapter introduces various different sources that can be used to obtain data for OSINT. These are described in Section 3.1, ranging from simple tools giving some specific information about the queried keyword, such as geolocation of an IP address or a list of accounts associated with a username, to more sophisticated software that employs non-trivial algorithms, such as Shodan [4] or Darksearch [77]. Tools for OSINT automation also exist, utilizing many available sources and using the results to search for additional information. A few of these are described later in Section 3.2. There are books and articles discussing various OSINT tools and prac- tices in more detail. Chauhan and Panda [78] explain all the theory behind OSINT and describe some of the tools from this chapter in more detail, in- cluding instructions on how to use them. Revell et al. [79] establish a frame- work for the assessment of OSINT tools and best practices for their usage. The OSINT Handbook [80] provides an exhaustive list of all available OSINT tools and resources. Many websites aim to make the process of OSINT more organized and methodical. The OSINT Framework [81] (illustrated in Figure 3.1) provides a broad overview of OSINT-related tools that are either completely free or offer a limited usage for free. Your OSINT Graphical Analyzer (YOGA) [82] is a simple flowchart showing what a piece of information can be trans- formed into or used for, for example, how an IP address can be used to find other relevant data, such as domain name. OSINT Open Source Intelligence Framework [83] is similar to the OSINT Framework but goes a bit further by adding educational resources, listing notable companies and people who contribute to the OSINT realm, and much more.

3.1 Overview

Real Names Gathering information about people using their real names depends on their country of origin and the uniqueness of their name and knowledge of other related information, such as address or date of birth. Many countries keep records of various public information, including prop- erty ownership, criminal activity, weddings, births, and deaths. Each country has different rules and laws for what is considered public information and what is kept secret. There are tools aiming to automatically collect data from many public sources to find as much information about an individual as possible. Pipl [84] is an identity resolution engine that tracks online iden-

17 3. OSINT Sources and Tools

Figure 3.1: The OSINT Framework website [81] showing a structured view of free OSINT-related tools. These are divided into categories based on the type of information they provide. tity information, uncovers associations between different people, and more. Ancestry [85] is a popular service that allows users to find information about their ancestors by specifying names, addresses, and locations of themselves and their relatives. As there are no mechanisms to check whether the person looking for the information is related to the people of interest, it can be misused for intelligence purposes.

E-mail Addresses and Usernames E-mail addresses can be verified us- ing MailTester [86] that performs series of checks. HaveIBeenPwned [87] collects data breaches and allows users to check whether an e-mail address and the password used with this address was part of a leak. CheckUser- Names [88] is a service that checks over 170 different social networks to

18 3. OSINT Sources and Tools

find out whether an account with a specified username exists. WhatsMy- Name [89] provides a list of all the data required to perform enumeration on these social networks directly. However, search results about usernames highly depend on the uniqueness of the username as different people often use the same username. This means that accounts with the same username across different social media services might not be used by the same person and therefore are not related. PhoneInfoga [90] collects international phone number information, such as country, area, carrier and line type.

Social Media People often share a lot of personal information on social media, so utilizing these services for OSINT data collection can provide a lot of valuable information. Facebook and Instagram enable users to make their information and posts accessible only to a selected group of people, but many users choose to make them public. Twitter is often used by professionals, po- litical activists, and other thought leaders, and analysis of Twitter posts can reveal the public opinion about various issues. LinkedIn is a social network for business relationships and contains information about the education, em- ployment history, and skills of its users. Additionally, users often share their phone numbers, e-mail addresses, and other contact information. All of these services provide an API for access to the content [91, 92, 93]. There are also third-party services that can help with social media in- formation collection and analysis. Tinfoleak [94] analyzes Twitter accounts and provides information about the users such as devices and operating sys- tems used by the user, geolocation of the posts, and other users mentioned in the posts. Social Searcher [95] monitors all public social media posts for mentions of a specific keyword, such as a name of a person, company, or product, and finds social media accounts of the specified person.

IP Addresses and Domains When looking for information about a spe- cific IP address or domain, various services provide plenty of useful data. Websites such as DSNlytics [96] and IPinfo [97] give a comprehensive list of in- formation about the queried IP address, including geolocation, reverse DNS data, ASN info, related domains, and much more. IKnowWhatYouDown- load [98] discloses all torrent files associated with an IP address. Domain names generally yield a lot of information, such as associated IP addresses, subdomains, other related domains, details of the registrar, contact person, and so forth. Crt.sh [99] is a database of all publicly issued certificates. It can find all past certificates of the domain. Urlscan.io [100] browses a submitted URL, records the activities happening during this process (for instance, vis-

19 3. OSINT Sources and Tools ited domains and IP addresses), saves the resources of these domains, takes a screenshot, and so forth.

Blacklists A reputation of an IP address, domain, or e-mail address is often valuable during intelligence collection. Blacklist is a widespread mech- anism that keeps track of elements known to perform malicious activities, such as spamming e-mail addresses, URL addresses containing spam or set up for phishing, and IP addresses associated with botnet activities. Many dif- ferent companies operate their blacklist with different criteria for adding and removing new elements. SpamCop [101], SURBL [102], and SORBS [103] are established DNS-based blacklists of IP addresses and websites transmitting unsolicited messages. The primary purpose of this type of blacklists is to be used by mail-servers and accessed through DNS. There are also many black- lists accessible in a simple plain-text format. Few notable examples include Abuse.ch SSL Blacklist [104] containing SSL certificates used by botnet C&C servers, Malware Domain Blocklist [105] listing domains that are known to be used to propagate malware and spyware, and FireHOL IP Lists [106], which analyzes many different IP blacklists to provide a compound blacklist for various systems.

Cyber Threat Intelligence Section 2.2.2 outlines how OSINT can be used to aid the collection of data about potential cybersecurity threats and threat actors. Various companies make this data publicly accessible in the form of cyber threat intelligence (CTI) feeds. CTI feeds can be used by companies to build defense mechanisms for their infrastructures and mitigate the risks. CTI feeds are essentially an advanced type of blacklists, therefore, they can be used as a source of intelligence as well. Open Threat Exchange (OTX) [107] by AlienVault is a platform for the security community to share, validate and research latest threat data. OTX can be integrated into many existing tools via an API or queried directly through the ThreatCrowd [108] website. MetaDefender [109] and FortiGuard [110] are some of the other threat intelligence services.

Reconnaissance A significant category of software that can be used for OSINT is reconnaissance tools. BuiltWith [111] analyzes the technology stack of a website. Spyse [112] offers a pack of tools for network scanning, DNS query validation, IP whois lookup, information about a domain, and many more. Censys [113] aims to provide a complete network monitoring solution for companies that want to prevent exposure to their assets. Their

20 3. OSINT Sources and Tools database is available to the public through a full-text search engine with a limited number of queries. Shodan [4] is a popular full-text search engine that discovers all devices directly accessible through the Internet, checks which ports are open, and what kind of service runs there. All the data are kept in a database and con- tinuously updated. Knowing a name and a version of the software running on some device connected to the Internet might be valuable for both a po- tential attacker and a user with honest intentions. Nmap [114] is a tool used to discover hosts and services of a network by analyzing responses. MASS- CAN [115] is similar to Nmap but scans the whole Internet. Unlike Shodan, Nmap and MASSCAN are not services providing a database of different In- ternet devices that were already analyzed but perform the investigation in real-time.

Search Engines Although created for a different purpose, search engines like Google [116], Bing [117], and DuckDuckGo [118] have an API for custom search queries that can be used for OSINT. Google Dorking [119] is a tech- nique using advanced features of the Google search engine to find information that is otherwise not presented by the website and does not appear in a basic search query. For example, by using the operator intitle, it is possible to find all websites using a specific version of the software that is known to have a vulnerability. There are also many specialized search engines. Ahmia [120], Torch [121], and Onion Search Engine [122] are popular services for indexing of the Deep Web (.onion sites). Darksearch [77] aims to go even further by searching through the dark web and accessing black markets, restricted sites, and ille- gal content in general. Some search engines, for instance Kilos [123], are de- signed to be used for illegal activities in the first place. The primary purpose of Kilos is to search the black markets to buy drugs, guns, counterfeit docu- ments, etc. Another noteworthy category is source code search engines such as PublicWWW [124] that explores the source code of websites, or Search- Code [125] that searches public source code in general. Fagan Finder [126] provides an interface allowing users to query one of many different search engines with redirection to the results.

Metadata Metadata are very helpful for the management of files on a com- puter. They carry a considerable amount of information about the file and can often reveal a lot. Images can yield information about the camera that took the picture, the date it was taken, and sometimes even the location. Dif-

21 3. OSINT Sources and Tools ferent documents, such as PDF files, can contain information about the au- thor or the system used to create it. FOCA [127] and Metagoofil [128] can find all files of a specific type on a given domain, obtain metadata from a doc- ument or a website, and find any similarities in metadata of multiple files. ExifTool [129] is an offline tool for image metadata extraction, while Google Images [130] can perform a reverse image search to find related images and websites containing this image.

Geolocation One of the useful pieces of information about a particular target, or at least of something that is known to be related to the target (e.g., a piece of text or an image), is its geolocation. Twitter, for exam- ple, allows to search posts by their GPS coordinates using the geocode key- word (e.g., geocode:,,0.1km). As already discussed in the previous paragraph, files often contain information about geolocation. Flickr, an image and video sharing service, provides a map [131] showing ge- olocation of all images containing a geotag. CLAVIN (Cartographic Location And Vicinity INdexer) [70] enables the extraction of geolocation information from textual data, such as social media posts, by using natural language processing and machine learning. The geolocation is not only extracted from the text (e.g., by finding names of cities or countries), but also compared with the rest of the document to make it more precise. CLAVIN is currently not under active development.

Data Archives In general, OSINT is usually performed using tools de- scribed in Section 3.1, i.e., a target keyword, an IP address for example, is queried in multiple tools, and the results are manually or automatically collected and evaluated. However, sometimes this approach does not pro- vide all the answers, and gathering data in bulk and the subsequent analysis might reveal more information. Finding an IP address in a block of data with a known origin can provide context or some other identifying data, such as an e-mail address. Due to the sheer amount of data on the Internet, collecting all the data would require a tremendous amount of storage. For this reason, specific parts of the Internet content need to be targeted, e.g., services providing data with a certain origin and purpose. Global Terrorism Database [66] collects information about international terrorist attacks that occurred since 1970. An interesting source of data is also a grey-area content, for instance, Wikileaks [11] or data leaks in general. However, this type of data might not fit some definitions of OSINT, as the content is not originally meant to be public.

22 3. OSINT Sources and Tools

Pastebin is a popular type of service used to store and share plain text, for example, source code or any other text that is formatted or is too long to share directly through a messaging application. Pastebins are usually public, and anyone can access the text shared by other users. Since some users are not aware of this or do not consider it as a problem, they might use this service to share private information. Additionally, pastebins are often used to share private information on purpose, such as database leaks, meaning that they can be a good source of information for an OSINT investigation. There are multiple ways data from pastebins can be gathered. PasteLert [132] is a service that sends an e-mail whenever a search term appears in a new paste. Sniff-Paste [133] scrapes pastebins, stores them in a database, and searches for noteworthy information. There are also the so-called pastebin dumps that provide all pastes in one place.

Web Scraping Parts of the Internet with potentially valuable information can also be collected locally. Periodical aggregation of some websites might be beneficial since the content of the Internet changes constantly, and valuable information could disappear before the time of the investigation. To achieve this, one can use web scraping, a technique used for extraction of data from websites, or digital libraries maintaining an archive of Internet content, such as the Wayback Machine [134]. There are many tools and libraries for web scraping. Web Scraper [135] provides an interface for interactive manual web scraping. Scraper API [136] handles all small details required for scraping, such as proxies. ScrapeSim- ple [137] is a service that builds custom scrapers based on the requirements of the customers. Scrapy [138] is an open-source python framework for build- ing and running scrapers. A special case for web scraping is the deep web, i.e., the part of the Internet that is not indexed. Just like special search engines are needed to explore the deep web, web scrapers with different al- gorithms have to be used. Ahmia browser provides the code for scraping of the deep web as open-source [139]. It is based on the Scrapy library, and it saves the results into an Elasticsearch database. To use it, both their code for indexing [140] and the Tor browser need to be installed and running.

3.2 OSINT Automation

Even with the broad selection of OSINT tools, searching for valuable infor- mation about the target can be overwhelming. To overcome this obstacle and make things easier for the OSINT investigator, proprietary products such as Intelligence X [141] and ShadowDragon [142] exist. Intelligence X

23 3. OSINT Sources and Tools is an OSINT software designed to allow the user to perform any kind of open source intelligence. The search engine uses selectors such as IP address, URL, or Bitcoin address and explores different sources of OSINT, including dark web, public data leaks, and document sharing platforms. Additionally, it keeps a data archive of historical results. ShadowDragon is a framework for OSINT collection, monitoring, and analytical investigation with five tools dealing with different aspects of OS- INT. MalNet collects threat information about domains, IP addresses, and malware samples, and searches for any correlations. SocialNet tracks individ- uals across different social media networks and tries to uncover networks of accounts. OIMonitor brings together intelligence from different sources and provides a graphic interface for real-time monitoring of any new information. AliasDB keeps a database of data aggregated in the past to preserve informa- tion that might not be accessible anymore. Spotter creates an environment for active interaction with the target by redirecting to a website that keeps track of the whole communication. TheHarvester [143] is a simple open-source tool for automated aggre- gation of OSINT about e-mail addresses, domains and subdomains, URLs, and IP addresses, designed to be used in the early stages of penetration testing. It is included in the default installation of Kali Linux. Most of the data sources used by theHarvester are various search engines, starting from general-purpose services like Google, Bing, and DuckDuckGo, to spe- cialized engines, including Shodan and ThreatCrowd. All the data found by the search engines are parsed and examined to find any valuable informa- tion. For example, it uses Google to search Trello boards, Twitter accounts related to a specific domain, and LinkedIn users. TheHarvester also utilizes Intelligence X, another OSINT automation tool. The goal of this thesis was to create a tool for OSINT automation. Some of the noteworthy existing tools with similar objectives are described in the remainder of this chapter and compared to the implementation in Sec- tion 5.2.

3.2.1 Recon-ng

Recon-ng [144] is an open-source framework for OSINT automation. It is highly modular and customizable, with the base framework providing all the essential functionality and accessory functions needed to perform the in- vestigation, while the data are collected using separate modules. The ad- vanced command-line interface has a command completion and interactive help with extensive documentation for all the commands and subcommands.

24 3. OSINT Sources and Tools

Additionally, Recon-ng also has a simple web interface. All the collected data are stored in an SQL database and managed through the db command. The database looks at all the information stored there as a potential new in- put for subsequent data gathering. Snapshots of the database can be created for simple data recovery in case of a failure. To present the results in a human- readable format, CSV and HTML files can be generated. Workspaces create the possibility to have multiple environments with independent configura- tions and database instances, allowing the user to switch between them as needed. Recon-ng can be run automatically through a so-called resources file containing all the commands for the framework to run. The modules are defined by an abstract class with a well-defined inter- face and some accessory functions from which new modules need to inherit. All the modules are available in a place called Recon-Ng Marketplace [145], which is an independent GitHub repository. As of now, the Marketplace con- tains around 100 different modules. Third-party modules that are not part of the Marketplace can also be used. These are loaded directly from a local Recon-ng directory. The abstraction allows the users to utilize any infor- mation source by simply creating a wrapper class around that source since the framework only requires the class to implement the interface. The mar- ketplace and modules commands are the entry point for the administration of the modules. Each module can define the type of input it takes, and the database is scanned to search for any data of this type that could be used as a new input for the module.

3.2.2 Maltego

Maltego [146] is a very well-known software platform for OSINT collection and analysis. It is a commercial product available in multiple versions with different features and limitations. The approach for the management of these sources is similar to the one used by Recon-ng. The modules, called trans- forms, are small pieces of code that transform data from any tool or product to a well-defined format. All transforms are available through the Trans- form Hub [147]. Just like Recon-ng’s Marketplace, there is a large number of different sources for information gathering, starting from tools discussed in Section 3.1 to other OSINT automation tools, e.g., ShadowDragon. A substantial advantage of Maltego over Recon-ng and other open-source tools is the visualization and analysis capabilities that allow the investigator to study connections between different e-mail addresses, domains, and other elements. The aggregated information can be represented by a custom entity, e.g., simple keywords such as IP address and company, or more advanced

25 3. OSINT Sources and Tools data types, like documents and social networks. Similar to Recon-ng, entities can be used as an input for further data collection. Entities are visualized as nodes, and in the case a connection was found, they are clustered to networks of nodes. Figure 3.2 shows an example of a graph generated by Maltego.

Figure 3.2: Illustration of Maltego’s graphical interface showing a search graph when nist.gov website is queried [148].

The whole process of OSINT investigation using Maltego starts with the selection of transform to use. This depends on the type of information the user uses as the initial entity and what they want to find. Another consideration is that many transforms require an API key that is often paid. Once the transforms are loaded, the entity is queried, and all new entities found by the initial data collection are displayed in the graph. This can be repeated for all newly discovered entities, and eventually, after all desired data discovery is finished, a directed graph shows all entities found during the whole investigation. Node connections show the relationship between different entities as well as the progression of how the entities were found. Graph nodes can also be marked with a score representing the confidence the user has for the result. For example, if multiple DNS names resolve to a single IP address, this address might have a high confidence score. To make graphs with many nodes and connections easier to navigate, Maltego also enables users to add notes, attachments, and bookmarks.

3.2.3 SpiderFoot SpiderFoot [149] is another open-source framework for OSINT automation. Like Recon-ng and Maltego, it is designed to be highly modular and provide all the necessary functions for data manipulation and storage. The significant advantage of SpiderFoot is the number of modules it implements, which is

26 3. OSINT Sources and Tools more than 170. It has both a command-line interface and a web interface. The starting points SpiderFoot can scan are domain names, IP addresses, hostnames/subdomains, subnets, ASNs, e-mail addresses, phone numbers, and human names. The features of the paid version SpiderFoot HX that are not available in the open-source version include Tor browser integration for deep web scanning, multi-target scanning, continuous monitoring with alerts and e-mail notifications, and correlation engine, which looks for anomalies and other notable results. SpiderFoot’s web interface provides an easy way to configure the app and the modules, add API keys, choose what modules to use for a scan, debug, and visualize the results in the form of a table and a graph. The graph representation is similar to the one in Maltego since it shows results as nodes and displays relationships by clustering them. Selected results can be marked as false positives, which also marks child elements and deletes them from the graph. SpiderFoot HX can run the data collection step-by-step to inspect how each result is discovered. Figure 3.3 shows the web interface of SpiderFoot.

Figure 3.3: Illustration of SpideFoot’s web interface showing a list of results for a particular scan [150].

27 4 Pantomath: Tool for Automated OSINT Collec- tion

The main goal of this thesis was to implement a tool for an automated collec- tion of open-source intelligence. Pantomath1 is a highly modular framework that provides a complete environment for collecting and evaluating OSINT about IP addresses, e-mail addresses, and domain names. The framework implements all the functionality required throughout the process of OSINT, but separate modules perform the collection itself. New modules can be inte- grated by merely implementing a well-defined interface. Some of the gathered data are evaluated in terms of their reliability. And finally, all the results are presented in a structured output. The rest of the thesis is organized as follows. Section 4.1 defines the prob- lem at hand, describes some of the main challenges, and how Pantomath strives to solve them. Section 4.2 outlines the high-level architecture of the tool and the functionality it provides. Section 4.2.3 goes into more detail about the implemented modules, and finally, Section 4.3 establishes a model for reliability estimation of some of the modules. Section 3.2 from the previ- ous chapter describes existing tools with similar objectives, and Section 5.2 from the following chapter compares them to Pantomath, states the major differences, and discusses the advantages and disadvantages of each.

4.1 Problem Statement

There are many ways a tool that automatically collects OSINT could be implemented since OSINT is an extensive topic, and the amount of avail- able data is enormous. The collection of OSINT also poses many challenges, many of which were discussed in Section 2.1. If the user aims to find specific information, such as social media network contacts and posts of a particular person or the footprint a company has on the Internet, various advanced methods specializing in these tasks can be utilized. Section 2.3 describes some of the more innovative approaches aiming to tackle specific OSINT challenges and tasks. Using social media networks as an example, one could take advantage of web scraping, sentiment analysis, machine translation, and deanonymization to create a complete profile of various users of these social media networks. However, each problem the user might need to solve requires a differ- ent approach, and building a state-of-the-art tool for each of these would

1. Pantomath is an English word for a person that wants to know and knows everything.

28 4. Pantomath: Tool for Automated OSINT Collection be a laborious task. Instead, existing tools and services that already imple- ment non-trivial gathering of OSINT can be utilized. Chapter 3 provides an overview of such sources. Therefore, the objective of Pantomath is not to actively collect data and transform them into valuable information, but rather to automate the collection of OSINT by employing the existing tools. Since the number of possible sources is immense and new ones can appear in the future, one very desirable feature for the tool is an easy integration of additional sources. This approach brings new disadvantages that have to be taken into con- sideration. By using existing services, Pantomath relies on their correct and continuous operation and needs to address any changes that break the cur- rent implementation. These services are also often built into commercial products and, as such, provide only a limited usage for free or no free ver- sion at all. For some types of information, there are no services that are not paid, meaning that an API key needs to be bought or the information is not available. One example of such a tool is BuiltWith [111], which analyzes the technology stack of a website, thus providing very valuable information about a domain. However, the price for access to BuiltWith’s API starts at $295 per month. Anyhow, the selection of which paid services to choose heavily depends on the use case. One of the most significant challenges of OSINT that has not been ad- dressed by any of existing OSINT automation tools2 is evaluation of the re- liability of the collected information, or in other words, getting as close to Validated Open Source Intelligence described in Chapter 2 as possible. Val- idated OSINT is defined as OSINT with a high degree of certainty, which is very hard to achieve without a certain level of human involvement. Any OSINT investigation will eventually require a person with some knowledge of the context of the investigation and OSINT itself to evaluate the gathered data. The goal of Pantomath is not to provide any guarantees of the informa- tion reliability but rather to allow the investigator to make more informed decisions by producing a reliability estimate. Collecting OSINT is usually not just about finding information about one particular target but rather a repetitive cycle where follow-up searches are refined based on information discovered in the previous iterations. By performing these sequential searches, relationships between different targets can be revealed. Similarly to the information itself, the reliability of the dis- covered targets and their relationships with other targets can be estimated.

2. To the best of our knowledge, no existing tools provide an automatic reliability estima- tion.

29 4. Pantomath: Tool for Automated OSINT Collection

Pantomath aims to automatically perform follow-up searches to reveal re- lated targets and their relationships and provide a reliability estimate of these relationships. Another factor that has not been considered by existing OSINT automa- tion tools is the footprint the OSINT investigation leaves behind when per- forming the data collection. Generally, this is not a concern if the search queries are not confidential. However, if the target itself is private informa- tion or the investigator wishes to be disconnected from the query, additional techniques have to be used to address these requirements. The interaction with various OSINT services instead of direct communication with the tar- gets provides an intermediary that prevents the targeted or third parties to link the queries to the investigator. Nevertheless, the service providers still know the user’s specific queries, which is amplified if the same provider manages multiple of the queried services. Therefore, this level of indirection is not sufficient for highly confidential inquiries. Pantomath can be used in three modes of operation to accommodate varying anonymity requirements, and these are presented in Section 4.2.2.

4.2 Architecture and Functionality

The architecture of Pantomath is separated into two main parts. The base framework provides a complete environment for automated collection of OSINT, with independent modules performing the data gathering itself. The framework defines an interface that each module needs to implement, and the interface is described in more detail in Section 4.2.3. Search queries are narrowed down to simple keywords representing entities on the Internet. Currently, the target can be an IP address, a domain name, or an e-mail address, but the selection can be easily extended with user names, phone numbers, Bitcoin addresses, and other identifiers. All results that might be used as new targets, i.e., any of the IP addresses, domains, and e-mail addresses related to the target, are added to a pool and used for follow-up searches. The decision of what should be considered as a new target is left for each module. Each new target keeps track of how it was discovered, including the target used in the previous query, the relationship between these targets, and the module that disclosed it. Therefore, a kind of a search tree forms, where the initial target serves as the root of this tree and has a depth of 0. Figure 4.1 shows an example of a search tree with the domain fi.muni.cz used as the seed. The nodes represent newly found targets, and the edges represent the discovery by various modules.

30 4. Pantomath: Tool for Automated OSINT Collection

Figure 4.1: An illustration of the search tree when querying fi.muni.cz up to a depth of 2.

Each target discovered when querying the initial value is a child node of the root, has a depth of 1, and is queried the same way as the initial value. Again, new targets are extracted from the results. The whole process is repeated until new nodes with a specific depth are found. The depth can be specified before the query is started and increased if necessary. Generally, the deeper the target is, the weaker the relationship is with the initial target, and for some better-known targets, there might be hundreds of new targets even at the first level.

4.2.1 Base Framework The Pantomath framework contains various functions for smoother opera- tion of the tool and more convenient data processing. All events happening throughout the operation of the tool are logged to a file for easier debugging. The fetch_url function is the sole point of communication with the outside world. This design enables the use of the stealth mode described later in Section 4.2.2, which utilizes the Tor network to interact with the Internet. Besides the possibility to use the stealth mode, integration of Tor also allows for potential future implementation of modules that require it to communi- cate with onion services3. The fetch_url function also keeps track of timestamps of the last time each service was queried and waits for a few seconds before sending another request. This prevents potential situations where the tool sends too many

3. Onion (or hidden) services are anonymous services only reachable through the Tor network.

31 4. Pantomath: Tool for Automated OSINT Collection requests in a short period of time to the same service and gets banned. The delay is specified in the configuration file, and it can be different for each service. Moreover, the framework contains functions for extraction of IP, e-mail, and Bitcoin addresses from blocks of data and validation of these values, including a function that attempts to fix invalid e-mail addresses (e.g., by removing trailing characters that are added to disguise the addresses and make it harder to extract them). The database software used in the offline mode described later in Sec- tion 4.2.2 is PostgreSQL [151], an object-relational database system allow- ing a great deal of flexibility for the schema by having a wide variety of objects it can store. The framework provides all the necessary functions for the database management, such as for table creation and data storage and retrieval, and leaves the management itself to each module. This gives each module enough flexibility in how it stores its data and how the data are re- trieved. Specifically, each table within the database stores three attributes: • the lookup key, • the timestamp of when the item was added, • a JSON object storing all the additional information. In most cases, the lookup key is just a string representing the retrieved target (e.g., a specific IP or e-mail address), but it is also possible to use a CIDR block as the key. In that case, the target IP address can be searched by checking whether it belongs to some of the CIDR blocks. Since the mod- ules manage the data storage and retrieval, they can use the lookup key as a regular unique ID and implement some more complex data indexing. The timestamp is used to discard items that are expired, and it is managed by the framework. This feature can be disabled, and the expiry time can be specified in the configuration file. Pantomath can be used either through an API or a command-line inter- face. The framework is also built to be easily extendable with other inter- faces, such as a web interface, a custom API, and so forth. The command- line interface provides the commands listed in Table 4.1. The update com- mand updates the offline data for all modules. The query command queries the specified target in all modules, stores the results, and adds any new tar- gets found by the modules to a pool used for follow-up searches. The overt, stealth, and offline commands indicate which mode of operation should be used. The query results can be exported to a JSON file or printed out to the standard output in a structured form using the export and print com- mands.

32 4. Pantomath: Tool for Automated OSINT Collection

Command Description exit Exit the CLI export Export the results into a JSON file help Display a menu with descriptions of all available commands modules Display a list of loaded modules with a description offline Switch to offline mode: only offline sources are queried overt Switch to overt mode: all sources are queried print Print the results in a structured form query Query the target in all available modules stealth Switch to stealth mode: Tor is used to fetch the data update Update the offline data in all available modules

Table 4.1: Commands provided by the command-line interface

Figure 4.2 provides a high-level overview of the architecture and what happens when a specific target is queried. The representation is simplified only to show important aspects. Before running the fiquery, the database needs to be updated to make all offline data available, which means the black- lists module shown in the schema retrieves data that was downloaded and parsed in advance. Each module fetches and parses data from the service it uses and returns the results to the framework. The framework is also notified if a new target was found and either queries this target or asks the user if it should continue. Once the targets are queried, all results are returned as a JSON object.

4.2.2 Modes of Operation To address the concerns formulated in Section 4.1 regarding traces left behind when performing the collection of OSINT, Pantomath offers three modes of operation for different anonymity requirements. An overt mode, a stealth mode, and an offline mode. The overt mode represents a regular operation of the tool, i.e., all modules are queried, and an Internet connection is re- quired. This mode can be used whenever the user is not concerned about the anonymity of the queries, e.g., when a company or privacy-conscious people investigate their presence on the Internet. As already discussed in Section 4.2.1, all the communication between Pantomath and the Internet goes through the fetch_url function. This design

33 4. Pantomath: Tool for Automated OSINT Collection

Figure 4.2: A high-level overview of Pantomath’s architecture. enables the stealth mode by integrating the Tor network. In this mode, all requests sent to any of the used services go through the Tor network, making it much more difficult to trace the request to the machine where Pantomath is running. The use of Tor is just an illustration of how the stealth mode can be implemented. Besides Tor, the fetch_url function could be extended to utilize custom proxies or other forms of anonymization, or the tool could be deployed on a cloud infrastructure. Although the stealth mode supports the use of all modules, one important consideration is the use of API keys. For the services that require an API key, all requests can be easily correlated to the user even though the requests are sent through intermediary nodes since API keys are generally tied to a registered account. This obstacle could be bypassed by registering accounts using only anonymous e-mail accounts and fake information. However, many services employ non-trivial protection against this technique. The offline mode takes the anonymity a step further by performing queries with no access to the Internet. Instead, all data available as a whole are downloaded, parsed, and stored in the database in advance using either overt or stealth mode. Once the database contains fresh values, connection to the Internet is no longer necessary, and the offline mode can be acti- vated. Each query checks whether anything related to the target is stored in the database. By downloading data as a whole and not requesting in- formation for various targets separately, the data provider or anybody who observes what was downloaded only knows about the possession of this data and not about what exactly the data are used for. Preprocessing the data in advance and storing them in the database also brings additional performance

34 4. Pantomath: Tool for Automated OSINT Collection advantages whenever multiple queries are executed. The data are parsed only once, and all subsequent queries just check the database, which is a much faster operation. From the list of modules providing some offline data in Section 4.2.3, it is apparent the selection is relatively small. In general, services rarely offer complete access to their data for free, and usually not even as a paid service. To have access to more data in the offline mode, the functionality provided by the existing services queried by Pantomath would need to be implemented within the tool. As discussed in Section 4.1, this would be a very laborious task entirely out of the scope of this thesis. However, for some of the modules, open-source tools providing similar functionality exist, meaning that Pantomath would only need to implement the continuous data collection. One example of such a tool is MASSCAN [115], which provides information about open ports of IP addresses similarly to Shodan [4].

4.2.3 Modules

Pantomath is designed to be highly modular, where new modules can be added by implementing a well-defined interface. The interface specifies three attributes – the type of targets the module takes as an input (IP address, domain name, e-mail address, or multiple of these), whether it provides data for the offline mode, and a description of the module. The only function that is mandatory for all modules is query, which takes a target, looks for all the information it can find, and returns the results in the form of a dictionary. There are no constraints on the structure of the results; they only need to be JSON-serializable. If the module provides offline data, i.e., data that can be downloaded completely, it needs to create a table in the database using the init_database function. After the tables are initialized, the module has to download, parse, and store the data in the database using the update_database function. As already mentioned, the table’s schema is flexible, and each module can imple- ment non-trivial indexing of its data. And finally, the offline_query function is an offline equivalent of the query function, which means it only retrieves data stored in the database. As discussed in Chapter 3, the list of all OSINT sources is enormous, and implementing all of them would be redundant and entirely out of the scope of this thesis. Instead, the focus regarding the modules was twofold – to create a framework that would allow for straightforward integration of new modules and to implement some of the more interesting modules that provide diverse types of information about all the possible targets, i.e., IP addresses, domain

35 4. Pantomath: Tool for Automated OSINT Collection names, and e-mail addresses. Table 4.2 shows all the implemented modules, which target types they take as an input, and whether they require an API key. The remainder of this Section describes the modules in more detail.

Module Target Types Offline API key Follow-up

blacklists IPv4, Domain X crtsh Domain X darkweb IPv4, Domain X X datacenter IPv4 X dns_servers IPv4 X dns IPv4, Domain X geolocation IPv4 X multiple haveibeenpwned Domain, Email X ip2asn IPv4 X openproxy IPv4 X passive_dns IPv4, Domain X passive_ssl IPv4 X pgp Domain, Email X port_discovery IPv4 multiple psbdmp IPv4, Domain, Email X spyonweb IPv4, Domain X X threat_intel IPv4, Domain multiple torexits IPv4 X urlscan Domain X whois IPv4, Domain X whois_reverse Email X X Table 4.2: List of implemented modules with information about which types of targets they take as an input, whether they provide offline data, if an API key is required, and if the module can find new targets for follow-up searches.

36 4. Pantomath: Tool for Automated OSINT Collection geolocation This module queries various websites providing geolocation information about an IP address. The GPS coordinates are extracted from each result, or, if the website returns only an address without the coordi- nates, they are resolved from the address using the OpenStreetMap API [152]. The coordinates are then clustered using a threshold specified in the config- uration file, and a single pair of coordinates is computed for each cluster. The reliability of these cluster coordinates is estimated based on the reli- abilities of the websites that returned the coordinates within the cluster. These are specified in the configuration file and are continuously updated with values from new queries. The calculations use the model described in Section 4.3.2, and the initial values are based on measurements conducted in Section 5.1.2. Since most of the websites return the results as a JSON response, the pars- ing is performed using a configuration file, meaning that new sources of geolo- cation that yield JSON responses can be added by merely specifying the URL and the format of the response in the configuration file. Maxmind.com pro- vides a free version of their geolocation database, which is downloaded dur- ing the database update and can be queried in the offline mode. The module could be extended to use a paid version of their database and provide more re- cent and accurate results. Seven of the implemented services require an API key with varying limitations of the free version. Currently implemented web- sites are the following: • IPinfo [97] • FreeGeoIP [153] • IPify [154] • IP-API [155] • IPgeolocation [156] • IPdata [157] • Extreme-IP-Lookup [158] • Geoplugin [159] • IPwhois [160] • IPregistry [161] • WhoisXMLAPI [162] • IPlocate [163] • Utrace [164] • Maxmind [165]

37 4. Pantomath: Tool for Automated OSINT Collection threat_intel This module queries multiple cyber threat intelligence feeds to check whether the IP address or the domain name is considered malicious. Each CTI feed provides an evaluation of the risk associated with the tar- get. These are used to estimate the reliability of each feed using the model described in Section 4.3.1 and the total risk value for a particular target. The values are set in the configuration file and continuously updated, with the measurements performed in Section 5.1.1 used as the base. Apart from ThreatCrowd, all of the implemented services require API keys. However, all of them provide a limited number of requests for free. The following is the list of currently implemented feeds:

• ThreatCrowd [108] • VirusTotal [166] • AlienVault OTX [107] • MetaDefender [109] port_discovery This module queries multiple port discovery services to see which ports are open for the given IP address and what services are running there. The ports returned by different services are compared to esti- mate the reliability using the model defined in Section 4.3.3. Each estimate is based on the measurements performed in Section 5.1.3 and is continuously updated with values collected in new queries. The following are the currently implemented services (all require an API key):

• Shodan [4] • Spyse [112] • Censys [113] blacklists This module downloads blacklists of IP addresses and domain names with various criteria for addition of new entries. The blacklists are parsed, and all the entries are stored in the database. The blacklists con- tain addresses associated with botnets, spamming and phishing activities, and so forth. New blacklists available in a structured file can be added to the configuration file and parsed automatically. For completely unstructured blacklists (e.g., available as an HTML file), a separate parsing function can be implemented. The data only needs to be stored in the database and re- trieved accordingly in the query function. The following is the list of currently implemented blacklists:

38 4. Pantomath: Tool for Automated OSINT Collection

• SSLBL Abuse.ch [104] • Feodotracker Abuse.ch [167] • Myip.ms [167] • AlienVault [168] • Cinsscore [169] • Blocklist.de [170] • Spamhaus [171] • Openphish [172] • Zerodot1 [173] • Malwaredomains [105] crtsh This module searches historical certificates of the specified domain at crt.sh [99]. If the user enables it in the configuration file, each certificate is fetched from the website to find additional information. All the e-mail addresses associated with these certificates are added to the pool of newly found targets. darkweb This module parses a CSV file containing data scraped from the dark web and used for categorization of the websites in [174]. Each entry in the file contains a link to the website, its content, possible locations resolved from the content using CLAVIN [70], and the website’s category. The module looks for any IP addresses and domain names in the website’s content, and for each discovered IP or domain, the whole entry is saved into the database (i.e., if no IP or domain is found in the content, the entry is skipped). This module only illustrates how some of the library functions can be used when large volumes of data are processed because all of the steps to create the dataset need to be performed manually. datacenter This module downloads and parses a list of IP ranges owned by large companies and used as datacenters. The list is maintained in the IPcat project [175]. Each entry contains the range, the company’s name, and a link to its website. dns This module resolves the domain or IP address using Google DNS [176]. The answer is added as a new target for possible follow-up search.

39 4. Pantomath: Tool for Automated OSINT Collection dns_servers This module downloads and parses a list of IP addresses of DNS name servers maintained by Public-DNS [177], with additional infor- mation such as the name and the server’s location. haveibeenpwned This module uses haveibeenpwned.com [87] to check if a password of the specified e-mail address was leaked in the past or if the do- main was breached. Results of both queries include additional information about the breaches. An API key is required to use this module and costs $3.5 per month. ip2asn This module downloads and parses IP2ASN’s [178] database of ASN information for different IP ranges. openproxy This module downloads and parses a list of IP addresses that are open proxies according to multiproxy.org [179]. Each entry also contains a port used to run the proxy. passive_dns and passive_ssl These modules search for the IP address or the domain name in CIRCL.LU’s databases of historical DNS records [180] and X.509 certificates [181], respectively. Access to both of these databases needs to be requested and is granted only to researchers and security incident handlers. pgp This module searches for the domain name or the e-mail address in PGP public key servers, namely The.Earth.li [182] and Key-Server.io [183], which is used if the first server is not responsive. If anything is found for the queried domain, all e-mail addresses associated with the domain are retrieved and added as new targets. This module is somewhat fragile, as both of these websites are sometimes not accessible. psbdmp This module looks up the target IP address, domain name, or e-mail address in a Pastebin dump [184]. If any dump containing the target is found, the data is retrieved and searched for other potential targets. spyonweb This module looks up the IP address or domain name on SpyOn- Web [185]. For an IP address, the service looks for all domains that are hosted on the IP and adds these domains as new targets for follow-up searches. For a domain name, it looks for all Google Adsense and Google Analytics IDs the domain uses and then searches for additional information about the IDs,

40 4. Pantomath: Tool for Automated OSINT Collection including all domains sharing these IDs. These domains are again used as possible new targets. The service requires an API key, with the free version providing 10000 queries per month. Three paid versions are offered with prices starting from $6 per month. torexits This module downloads and parses a list of Tor exit nodes main- tained by Torproject.org [186]. urlscan This module searches the domain at urlscan.io [100]. The service visits the specified URL and records all activities happening during this pro- cess, such as which domains and IP addresses were visited. These domains and IP addresses are considered for follow-up searches. The results also in- clude all resources of these domains, a screenshot, and much more. Based on the configuration, either only URLs with the detailed results are attached to the results, or they are fetched and included. whois This module looks for the domain’s whois data on Whois XML API [162]. The service requires an API key, where the free version provides 500 credits (searches) per month. Their database can also be downloaded, with options to download 1 million entries for $240 or the whole database for an undisclosed price. whois_reverse This module looks for reverse whois data (i.e., informa- tion about domains registered with the e-mail address) either on Whoxy [187] or on Whoisology [188] if Whoxy is not available or does not return any results. Both services require an API key. Search credits for Whoxy can be bought with prices ranging between $4 and $8 per thousand credits based on the number of credits bought. Whoisology costs $50 per month with a maxi- mum of 2500 credits and $35 for every additional 2500 credits. The domains that are associated with the e-mail address are added to the pool of new targets.

4.3 Reliability Estimation

Besides the collection of OSINT itself, one of the requirements for Pantomath was to provide an estimation of the reliability of the collected data. As dis- cussed in Section 2.1, the best way to determine the truthfulness of OSINT is to establish the reliability of the sources the information was retrieved from and to use multiple sources providing the same type of information

41 4. Pantomath: Tool for Automated OSINT Collection and compare the results [18]. Additionally, having context and query-specific information is essential to avoid collecting information not relevant to the in- quiry [19]. Pantomath narrows the inquiries down to simple keywords and uses tools that provide a specific type of information about the given key- word, such as a geolocation of an IP address. Therefore, the collected infor- mation is implicitly relevant to what the user is looking for. It is important to note that the reliability estimation does not necessarily make sense for all the results. For example, the torexits module downloads a list of Tor exit nodes directly from the Tor project website. Although it is possible to obtain this information from other sources and compare the results, it could be argued the fact that the Tor project itself provides it gives enough confidence the result is correct. Another example of such a case is the dns module, where multiple servers could be queried and the answers compared. Nonetheless, an established DNS server, such as the one by Google used in the module, has enough credibility to be trusted. Another factor to consider is the need for multiple sources. As discussed in Section 4.1, many services are either paid or provide only a limited num- ber of free requests, meaning that using multiple sources for the reliability estimation can significantly increase the cost if some or all of them are paid. One of the cases where this applies is the whois_reverse module because the vast majority of reverse whois APIs does not offer any free queries. Addi- tionally, some services provide information that is too unique to be validated as multiple services would need to be combined to produce the information, such as the urlscan module, or there might not be other services providing it at all. Pantomath provides a reliability estimate for each new target that is discovered during the search. The seed target passed to the CLI has the re- liability set to 100%. The reliability of each new target is computed using the previous target’s reliability and the reliability multiplier of the module that discovered it. The multipliers are specified in the configuration file, and they can be different for each module. Currently, all modules use the same default multiplier, which is equal to 0.8. Figure 4.3 shows an example of a search tree, including the reliability estimates of all targets. As the multi- pliers are the same for all modules, targets with the same depth have equal reliabilities, i.e., 80% for level 1, 64% for level 2, 51.2% for level 3, and so forth. By setting different multipliers, the user can control how reliabilities for new targets are estimated. For example, the dns module might have a higher multiplier, as the relationship between the queried target and the newly discovered target is well-defined and generally has a high probability of being

42 4. Pantomath: Tool for Automated OSINT Collection

Figure 4.3: An illustration of the search tree that forms when querying fi.muni.cz with the reliability of each depth in red. correct. On the other hand, modules such as darkweb or psbdmp, where new targets are discovered by extracting e-mail and IP addresses from blocks of data, could have lower multipliers since the relationship between these targets is unclear, and the extraction might not be precise. With varying multipliers, the same depth targets could have different reliabilities, and some targets might even have smaller reliabilities than those in lower levels depending on the discovery chain. Additionally, the reliabilities could be used as the indicator of which targets should be queried instead of the depth, as they better represent the strength of the connection to the initial target. The model proposed by Gong et al. [56] described in Section 2.3 provides a systematic approach for reliability estimation of results collected from cy- ber threat intelligence feeds. A simplified version of this model is used to cal- culate the reliability of data in threat_intel, geolocation, and port_discovery modules. In each module, the reliabilities of all implemented sources are estimated, and the results for a specific target are evaluated using these esti- mates. The initial values are set to the ones obtained in Section 5.1 and are continuously updated after each query, meaning that each time a target is queried in the module, the reliability is recalculated. The values can be reset and calculated from scratch using data provided by the user. The more values are collected and added to the reliability es- timation, the more these estimates reflect how different sources perform in the scenarios the user is interested in. For example, some websites used in the geolocation module could provide accurate results for IP ranges owned by large companies but be less precise when resolving the location of indepen- dent addresses. If the user mostly investigates individuals, the initial data where commercial IP ranges were also considered might distort the sources’

43 4. Pantomath: Tool for Automated OSINT Collection precision. The same goes for the port_discovery module, where the portion of IP addresses with no open ports significantly influences the reliability of different services.

4.3.1 Cyber Threat Intelligence The threat_intel module uses four different CTI feeds to collect information about the target. Each of these feeds returns a so-called risk value that eval- uates the risk associated with the target. The risk values are normalized to a value between 0 and 1 to be comparable with each other. When a specific IP address or domain name is queried, the reliability of the results is esti- mated as the ratio of returned risk values and the maximum possible risk value. In the model by Gong et al., the Cymon CTI feed [189] was used, but it is currently not operational, so it was replaced by MetaDefender [109]. The original model compares many different pieces of information returned by the feeds, but MetaDefender only provides the risk value, meaning that it would not be comparable with the remaining feeds. Table 4.3 describes the symbols used in the equations.

Symbol Description n number of CTI feeds

Fn n-th CTI feed

risk(Fi) risk value returned by CTI feed Fi

Table 4.3: Description of symbols used in the equations

Equation 4.1 calculates the distance between two feeds. It is equal to the difference between the normalized risk values.

dist(Fi,Fj) = |risk(Fi) − risk(Fj)| (4.1)

The expected risk risk(Fexpected) is computed by Equation 4.2 and is equal to the average of all risk values. If no value is returned by a feed, it is set to 0. Pn risk(F ) risk(F ) = k=1 k (4.2) expected n The error of CTI feed Fi, as shown in Equation 4.3, is calculated as the distance between the risk value returned by Fi and the expected value Fexpected, which is defined by risk(Fexpected).

error(Fi) = dist(Fi,Fexpected) (4.3)

44 4. Pantomath: Tool for Automated OSINT Collection

The independence of CTI feed Fi is equal to the average of distances between Fi and all other feeds, as shown in Equation 4.4.

Pn dist(F ,F ) independence(F ) = k=1 i k (4.4) i n − 1

The error of CTI feeds decreases proportionally to their independence. This is reflected in the weight of feed Fi shown in Equation 4.5. It is computed as its independence divided by the maximum distance between any two feeds, which is used as the boundary line of consideration. The result is a fraction between 0 and 1.

independence(Fi) weight(Fi) = 1 − n (4.5) maxj,k=1(dist(Fj,Fk))

Finally, Equation 4.6 computes the reliability of feed Fi. It is inversely proportional to the error divided by the maximum distance between any two feeds and proportional to the weight. Again, the result is a fraction between 0 and 1.

! error(Fi) reliability(Fi) = 1 − n weight(Fi) (4.6) maxj,k=1(dist(Fj,Fk))

The risk values collected for a particular target T and the reliabilities of the CTI feeds are used to estimate the reliability of the results, as shown in Equation 4.7. In this case, the final value is the risk associated with the queried target and it is attached to the results returned by the module.

Pn k=1 risk(Fk)reliability(Fk) reliability(T ) = Pn (4.7) k=1 reliability(Fk)

4.3.2 Geolocation

The geolocation module uses many IP geolocation services that either return GPS coordinates (i.e., two numbers – latitude and longitude) or an address that is resolved to coordinates using the OpenStreetMap API [152]. The co- ordinates provide a convenient way to compare results from different services and estimate the reliability. This section describes how the reliability of each geolocation service is computed and how the results for a particular IP ad- dress are evaluated. Table 4.4 describes the symbols used in the equations.

45 4. Pantomath: Tool for Automated OSINT Collection

Symbol Description n number of geolocation sites

Sn n-th geolocation site

latn latitude resolved by the n-th site

lonn longitude resolved by the n-th site m number of clusters

Cm m-th cluster p set of sites in m-th cluster

Table 4.4: Description of symbols used in the equations.

Firstly, Equation 4.8 computes the distance between two sets of coor- dinates resolved by sites S1 and S2. As these coordinates correspond to a point in a two-dimensional Euclidean space, it is calculated the same way as the Euclidean distance. q 2 2 dist(Si,Sj) = (lati − latj) + (loni − lonj) (4.8) Equations 4.9 and 4.10 compute the expected coordinates, which are equal to the average of coordinates resolved by all sites. Pn lat lat = k=1 k (4.9) expected n Pn lon lon = k=1 k (4.10) expected n The error of the coordinates resolved by site Si is computed as the dis- tance to the expected value Sexpected, as shown in Equation 4.11. The ex- pected value is defined by the expected coordinates latexpected and lonexpected.

error(Si) = dist(Si,Sexpected) (4.11)

The independence of geolocation site Si is computed using Equation 4.12. It is equal to the average of distances between the coordinates resolved by Si and the coordinates resolved by all the other sites. Pn dist(S ,S ) independence(S ) = k=1 i k (4.12) i n − 1 The weight of site Si shown in Equation 4.13 is equal to its independence divided by the maximum distance between any two sites.

independence(Si) weight(Si) = 1 − n (4.13) maxj,k=1(dist(Sj,Sk))

46 4. Pantomath: Tool for Automated OSINT Collection

Finally, Equation 4.14 computes the reliability of site Si. Just like the reli- ability of CTI feeds, it is inversely proportional to the error and proportional to the weight, and the result is a fraction between 0 and 1. ! error(Si) reliability(Si) = 1 − n weight(Si) (4.14) maxj,k=1(dist(Sj,Sk))

When coordinates from all sites are collected, they are clustered using a clustering algorithm with a threshold defined in the configuration file (the default value is set to 0.2). The clusters are mutually exclusive, and the num- ber of clusters m can be anywhere between 1 and n. For each cluster, the ex- pected location is computed, and the reliability of the location of cluster Ci is equal to the sum of reliabilities of sites that constitute the cluster di- vided by the sum of all reliabilities, as shown in Equation 4.15. In the end, the module returns one or more pairs of coordinates and their reliabilities.

P reliability(S ) Sj ∈Ci j reliability(Ci) = Pn (4.15) k=1 reliability(Sk)

4.3.3 Port Discovery The port_discovery module checks which ports are open on the given IP address. As the port numbers are categorical values that can be easily com- pared between different services, they can be used for reliability estimation. The results where none of the services returned any open ports are also added to the calculation since these can be considered as equal results by all services. This section explains how the reliability of each service is com- puted and how the results for a particular IP address are evaluated. Table 4.5 describes the symbols used in the equations. Equation 4.16 calculates the distance between the results from Pi and Pj. The distance is equal to the number of ports that were returned by only one of the services, i.e., the cardinality of the symmetric difference of Ri and Rj.

dist(Pi,Pj) = |Ri4Rj| (4.16)

The expected set Rexpected is a set of ports that were in more than a half of sets Rn, i.e., more than a half of the port discovery services consider the ports open, as shown in Equation 4.17. n R = {p |q > } (4.17) expected i i 2

47 4. Pantomath: Tool for Automated OSINT Collection

Symbol Description n number of port discovery services

Pn n-th port discovery service m total number of resolved ports

pm m-th resolved port

Rn set of ports resolved by n-th service

Qm set of services that resolved m-th port

qm number of services that resolved m-th port

Table 4.5: Description of symbols used in the equations

The error of service Pi is computed with Equation 4.18, and it is equal to the distance between the service and the expected value Pexpected defined by the set Rexpected.

error(Pi) = dist(Pi,Pexpected) (4.18) The independence computed by Equation 4.19 is equal to the average of distances between service Pi and all other services. Pn dist(P ,P ) independence(P ) = k=1 i k (4.19) i n − 1

Equation 4.13 computes the weight of service Pi. It is equal to its inde- pendence divided by the maximum distance between any two services.

independence(Fi) weight(Pi) = 1 − n (4.20) maxj,k=1(dist(Pj,Pk))

The reliability of service Pi is calculated from the error and the weight, resulting in a fraction between 0 and 1. ! error(Pi) reliability(Pi) = 1 − n weight(Pi) (4.21) maxj,k=1(dist(Pj,Pk)) Once the ports from all services are collected, each port is evaluated in terms of its reliability. The reliability is computed as the sum of reliabilities of services that returned the port divided by the sum of all reliabilities, as shown in Equation 4.22. The results returned by the module contain the set of ports and the respective reliabilities. P reliability(P ) Pj ∈Qi j reliability(pi) = Pn (4.22) k=1 reliability(Pk)

48 5 Evaluation and Discussion

This chapter evaluates how Pantomath tackles some of the challenges of the collection of OSINT. The reliability estimation model defined in Sec- tion 4.3 is evaluated, and the measurements are presented and discussed in Section 5.1. Section 5.2 compares Pantomath to three existing OSINT au- tomation tools and states the main advantages and disadvantages of each. Section 5.3 outlines some of the extensions that could be added to Pantomath and the improvements that could be made to the existing functionality.

5.1 Evaluation of Reliability Estimation

The reliability estimation model defined in Section 4.3 compares all sources used in each module to compute their reliability, and the values are continu- ously updated when new targets are queried. The estimates reflect the pre- cision of each source for the type of data that was used for the estimation, and more data generally yields better accuracy. The model was evaluated using large datasets to provide base reliabilities for the user and measure how different sources perform.

5.1.1 Cyber Threat Intelligence The reliability of the CTI feeds used in the threat_intel module was evalu- ated using a dataset generated from various blacklists used in the blacklists module. As the module uses both IP addresses and domain names as targets, the dataset contained 500 values of each. The whole dataset was queried, and the values defined in Section 4.3.1 were computed. Table 5.1 shows the mea- sured values, and Table 5.2 shows the differences between all pairs of feeds to evaluate how similar they are. The biggest challenge when comparing different CTI feeds is the diver- sity of results they return. Each feed uses different metrics for the risk value, which makes it hard to compare. ThreatCrowd evaluates the risk in only three categories, whereas VirusTotal gives highly varied results. The risk val- ues and how much information each feed returns also changes significantly when querying IP addresses and domain names. VirusTotal and AlienVault evaluated many targets with zero risk, meaning they have a small average dis- tance between them, as shown in Table 5.2. On the other hand, ThreatCrowd often gave the highest possible risk value, which significantly increased the in- dependence. That resulted in ThreatCrowd having the lowest reliability out of all the feeds, and AlienVault and VirusTotal having high reliabilities.

49 5. Evaluation and Discussion

Website Independence Error Weight Reliability AlienVault 0.1827 0.1325 0.6128 0.4583 ThreatCrowd 0.3402 0.2517 0.3601 0.2649 VirusTotal 0.1768 0.1177 0.6206 0.4773 MetaDefender 0.2464 0.1695 0.4059 0.3136

Table 5.1: Measurements of the values defined in Section 4.3.1.

Website AlienVault ThreatCrowd VirusTotal MetaDefender AlienVault - 0.3140 0.0729 0.1612 ThreatCrowd 0.3140 - 0.2929 0.4135 VirusTotal 0.0729 0.2929 - 0.1645 MetaDefender 0.1612 0.4135 0.1645 -

Table 5.2: The average distance between all feeds. The lower is the value, the closer are the risk values of the two feeds.

The model by Gong et al. [56] compares many different features to es- timate the reliability of the CTI feeds, such as hashes of malicious files associated with the target or IP addresses used in the same attack. As these values represent distinct entities that are much more comparable, the com- parison of feeds using this model is more methodical and better illustrates the differences between them. By using multiple features, the reliability esti- mation in the threat_intel module would be more reliable than the current one. However, implementing such a model is non-trivial and requires a lot of parsing to bring various pieces of information together.

5.1.2 Geolocation

The reliability of websites used in the geolocation module was estimated using a dataset of 1000 randomly generated IP addresses. All multicast, reserved, private, or loop-back addresses were filtered out. The IP addresses were resolved by all geolocation services, and the values defined in Section 4.3.2 were calculated. Table 5.3 shows the measured values and the null ratio, i.e., the percentage of IP addresses where no geolocation was resolved. The null ratio of the Utrace website was 90% due to an inconsistent operation. To determine how dependent the websites are between each other, the pairwise distance of these websites was generated and is shown in Table 5.4. A group of 5 websites all have relatively small distances between each other – FreeGeoIP, IPdata, Geoplugin, IPlocate, and Maxmind. These are

50 5. Evaluation and Discussion

Website Null ratio Independence Error Weight Reliability Adjusted rel. Known FreeGeoIP 0.3% 7.2528 5.8982 0.6520 0.5056 0.4822 5.5557 IPdata 0.2% 6.8334 5.4831 0.6651 0.5208 - 1.3827 Extreme-IP 0.5% 10.2773 8.8586 0.5923 0.4419 0.4493 8.9512 Geoplugin 0.4% 7.1811 5.8395 0.6539 0.5068 0.4846 10.5209 IPregistry 0.1% 6.9512 5.5843 0.6732 0.5243 0.5245 1.7526 IPlocate 0.2% 6.8296 5.4639 0.6581 0.5125 0.4879 4.6368 IPinfo 0.2% 7.7042 6.2717 0.6090 0.4582 0.4642 0.7729 IPwhois 0.2% 10.8974 9.2205 0.5631 0.4158 0.4231 9.0029 IPify 0.1% 7.4466 6.0745 0.6340 0.4818 0.4926 4.8343 IP-API 0% 7.1182 5.7016 0.6669 0.5182 0.5200 0.7706 IPgeoloc 0% 9.4214 7.8840 0.5494 0.3997 0.4110 8.7674 WhoisXML 0% 7.4479 6.0814 0.6174 0.4693 0.4684 3.3321 Maxmind 0.2% 6.8275 5.4938 0.6641 0.5195 0.4878 1.1600 Utrace 90% 5.9583 5.1341 0.6182 0.4658 - 37.4484

Table 5.3: Measurements of the values defined in Section 4.3.2. Column Ad- justed rel. contains reliabilities when IPdata and Utrace are removed from the computation. Column Known represents the average distance between the results given by each service and the known locations for a particular IP address. The lower are the values in this column, the closer the results are to the known locations. shown in bold in Table 5.4. The similarity of the results is observable in specific geolocations, where the coordinates are exactly the same for many IP addresses. One of the reasons this is happening might be the fact that Maxmind is a popular service providing a weekly updated version of their database for free. If that is the case, the small differences could be caused by the databases of these websites not being synchronized for all values. However, the relationships between them would need to be analyzed in more detail to find any dependencies. One of the pairs – IPdata and Maxmind – has an average distance very close to zero, meaning that the results from these websites were virtually the same. To better reflect the reliabilities, IPdata and Utrace (due to a high null ratio) were removed from the computations, and the newly computed values are shown in column Adjusted rel. in Table 5.3. These two websites are given reliability of zero in the configuration file. Arguably, the remain- ing websites that have similar results to Maxmind could be omitted from the evaluation as well, but the similarities were not as evident as with IP- data.

51 5. Evaluation and Discussion - 5.6141 5.6424 4.9061 5.4482 5.7971 5.4266 5.0361 8.4533 6.3511 6.2440 6.0856 6.8170 5.6367 Utrace - 7.5250 8.4720 9.4721 8.0620 6.5344 5.6367 1.9323 0.2460 2.7437 1.5009 11.7252 12.1958 11.3289 Maxmind - 7.2650 6.4100 7.1132 6.4388 6.6954 7.1109 4.5707 5.7068 8.6826 6.5344 6.8170 11.0694 11.2326 WhoisXML - 7.1167 7.5439 9.9889 5.8669 6.8925 8.6826 6.0856 11.3314 11.4249 10.2314 10.8621 11.5172 11.3289 IPgeoloc - 9.4695 8.0196 8.4749 9.3281 4.4684 8.3859 4.5488 9.1486 2.3990 6.8925 5.7068 8.0620 6.2440 IP-API - IPify 9.3585 8.9421 9.7167 4.6105 9.5077 5.0676 9.1594 2.3990 5.8669 4.5707 9.4721 6.3511 10.1220 - 8.9490 9.9856 9.1594 9.1486 9.9889 8.4533 11.5476 12.3284 11.2132 10.8867 11.7350 11.2326 12.1958 IPwhois - IPinfo 9.7614 8.2876 8.7193 9.6862 4.0415 8.7850 9.9856 5.0676 4.5488 7.5439 7.1109 8.4720 5.0361 - 7.8255 8.7850 9.5077 8.3859 6.6954 5.4266 1.5776 1.5855 1.5546 1.5009 11.1018 11.7350 11.5172 IPlocate - 8.7615 7.5709 7.2691 8.6757 7.8255 4.0415 8.9490 4.6105 4.4684 7.1167 6.4388 7.5250 5.7971 IPregistry - 8.6757 9.6862 9.7167 9.3281 7.1132 5.4482 1.3410 2.8688 1.5546 2.7437 11.2273 10.8867 10.8621 Geoplugin - 7.2691 8.7193 8.9421 8.4749 4.9061 11.7106 11.6694 11.2273 11.1018 11.2132 10.2314 11.0694 11.7252 Extreme-IP - 7.5709 8.2876 9.3585 8.0196 6.4100 5.6424 IPdata 2.0388 2.8688 1.5855 0.2460 11.6694 12.3284 11.4249 - 8.7615 9.7614 9.4695 7.2650 5.6141 2.0388 1.3410 1.5776 1.9323 11.7106 11.5476 10.1220 11.3314 FreeGeoIP Website FreeGeoIP IPdata Extreme-IP Geoplugin IPregistry IPlocate IPinfo IPwhois IPify IP-API IPgeoloc WhoisXML Maxmind Utrace services. The values inbetween each bold other. show the pairs of websites that belong to a group of 5 websites with small distances Table 5.4: The average distance between all services. The lower is the value, the closer are the results from the two

52 5. Evaluation and Discussion

A dataset of 1000 IP addresses with known locations was created to evaluate the precision of this model. The dataset contains addresses from IP ranges owned by Amazon, Google, NordVPN, and Masaryk University. The geographical locations of the ranges owned by Amazon, Google, and NordVPN were collected from the official websites [190] [191] [192]. The av- erage distance between the results from each website and the known locations is shown in column Known in Table 5.3. Overall, these values correspond to the estimated reliabilities quite well. A few of the websites performed better in this evaluation than expected from the reliability and vice versa. Similarly, the changes in the adjusted reliability seem to conform to the accuracy when resolving the known locations. However, it is important to note that some websites might use similar datasets to resolve the geolocation, which would distort the measured distances.

5.1.3 Port Discovery The reliability of the services used in the port_discovery module was eval- uated using a dataset of around 700 IP addresses. Approximately half of the addresses were collected manually from various sources to contain many open ports, and the other half of the dataset contained randomly generated IP addresses, which mostly have no open ports. All metrics defined in Sec- tion 4.3.3 were calculated, and the results are shown in Table 5.5. The average distance between all pairs of services is shown in Table 5.6.

Website Null ratio Independence Error Weight Reliability Shodan 54.2% 4.2371 0.7518 0.5101 0.4813 Censys 61.6% 4.7873 3.6928 0.6724 0.6377 Spyse 47.7% 3.4129 0.7341 0.6889 0.6590

Table 5.5: Measurements of the values defined in Section 4.3.3.

Website Shodan Censys Spyse Shodan - 5.6115 2.8626 Censys 5.6115 - 3.9631 Spyse 2.8626 3.9631 -

Table 5.6: The average distance between all services. The lower is the value, the closer the results are between the two services.

53 5. Evaluation and Discussion

As the dataset is not entirely random, it is not representative of the nor- mal distribution of open ports. With the randomly generated addresses, the differences between the reliabilities decreased compared to the addresses where multiple ports were expected to be open. A dataset containing only random addresses would result in estimates that better reflect average queries by the user. The fact that Shodan has the lowest reliability out of the three services does not necessarily mean it is the least precise in reality. With only three services used for the comparison and a relatively small dataset, the ef- fect of IP addresses with many open ports and other outliers significantly affect the calculations. However, the fact that the average distance between the results from Spyse and Censys is not the lowest means these services do not exhibit any dependence, which could, for example, indicate data from both services are outdated. To evaluate the precision of the reliability estimates similarly to the geo- location module, the open ports could be collected locally using Nmap [114] or MASSCAN [115]. Another option would be to create a dataset of IP ad- dresses where some ports are verifiably open. The estimation could also be extended with a comparison of other data returned by different ser- vices. All three services check if some well-known software is running on the port. The results also contain information about the used operating sys- tem, the transport protocol, and many other features.

5.2 Comparison with Existing Tools

Section 3.2 discusses some existing OSINT automation tools and goes into more detail about the three most notable ones – Recon-ng, SpiderFoot, and Maltego. Table 5.7 summarizes the main differences between these tools and Pantomath. Like Pantomath, all of these tools are designed to be modular to allow a straightforward integration of new sources. The most well-known sources are implemented in all the tools, meaning that a significant portion of the results will be the same. As opposed to SpiderFoot and Maltego, Recon-ng is entirely open-source, and new modules can be added to the Recon-ng marketplace by any devel- oper. The free version of SpiderFoot is open-source as well, but it lacks a great deal of functionality offered in the paid version (SpiderFoot HX). Maltego provides a free community version that misses many functions and is limited in terms of the number of queries and the integration of additional modules. Additionally, it has to run on Maltego’s cloud infrastructure, meaning that all the traffic has to go through their servers. The most significant advantage of Maltego is its state-of-the-art visualization capabilities. The results from

54 5. Evaluation and Discussion

Feature Recon-ng SpiderFoot Maltego Pantomath

Modular X X X X CLI X X X GUI X X X Visualization X X Number of modules ≈ 100 ≈ 200 56 21 Reliability estimation X Proxy integration X X X X Tor integration paid X Offline mode X Table 5.7: Comparison of Pantomath with other OSINT automation tools.

SpiderFoot are visualized in a simple graph as well, but this representation only provides a simple overview of the discovered targets. Combined with its GUI, the visualization in Maltego is very convenient and significantly improves the user experience. SpiderFoot has the upper hand in the number of implemented modules, which is much larger than in any other tools, and Pantomath is far behind in this regard. However, it is important to note that many modules in the ex- isting tools overlap in terms of the type of results, whereas Pantomath has a few modules incorporating many services providing the same type of infor- mation. For example, the geolocation module queries 14 different IP geolo- cation services, the blacklists module uses 22 blacklists from various sources, and the threat_intel and port_discovery modules use multiple services as well. All these services are implemented as separate modules in the exist- ing tools. If all services in Pantomath were separated, the total number of modules in Pantomath would be around 50. Arguably the two biggest advantages of Pantomath compared to the ex- isting tools are the different modes of operation and the reliability estimation. All the tools can use proxy servers for queries, but only the paid version of SpiderFoot integrates the Tor network. The stealth mode provides better anonymity guarantees compared to the overt mode with no significant draw- backs. The only consideration when using the stealth mode is the use of API keys that can potentially associate requests sent to the service with the user. None of the tools provides functionality similar to the offline mode, where queries can be performed with no access to the Internet. Even though the se-

55 5. Evaluation and Discussion lection of available data in the offline mode is significantly smaller than in the overt or stealth modes, it can be extended with new sources in the future. The specific improvements to the offline mode are detailed in Section 5.3. The reliability estimation aims to tackle possibly the biggest challenge of OSINT – validation of the acquired data. As the final evaluation of gathered data will eventually require a person with some knowledge about the context of the investigation, the goal of the reliability estimation in Pantomath was to allow the user to make more informed decisions rather than to provide any guarantees about the correctness of the information. This concept is implemented in only a few modules, but the approach can be applied to virtually any existing module. The continuous updates of the estimates also allow adjustments for the current situation and the type of data queried in Pantomath.

5.3 Future Work

The goals of Pantomath were to provide a framework that incorporates all the functionality needed for automated collection of OSINT, to implement some noteworthy modules, and to lay the groundwork for some of the main challenges. Many possible improvements and extensions could be added to the implementation. Firstly, Pantomath currently considers three identifiers as potential targets – IP address in version 4, domain name, and e-mail address. Other identifiers could be added, e.g., IP address in version 6, user- name, real name, phone number, or Bitcoin address. In general, any keyword that identifies an individual or an organization might be used as a target. Parallelization of the search queries would significantly improve the per- formance of the tool. All modules are currently called sequentially, even though they operate independently, including connections to the database. Also, a significant slowdown is the timeouts used by modules to prevent the overwhelming of queried services, which could be bypassed if the remain- ing modules continued their activity in the meantime. The tool’s output is currently partially filtered and given to the user as a whole, including many details returned by the modules. To allow for more straightforward analysis, different detail levels could be specified, with the modules returning the re- sults accordingly. The configuration file contains a few options for the results produced by crtsh and urlscan modules. One of the most significant disadvantages of Pantomath compared to existing OSINT automation tools is the lack of a graphical user interface. The currently implemented simple command-line interface does not provide commands to change the tool’s settings conveniently or update API keys;

56 5. Evaluation and Discussion these have to be changed manually in the configuration file. Additionally, the GUI would help in making the tool much more interactive. For example, the selection of modules to use could be prompted for each query separately, the decision of which targets to use for follow-up searched could be entirely in the user’s hands, or the results could be easily filtered and displayed. Pantomath also does not have any visualization capabilities like SpiderFoot or Maltego. By creating a wrapper, Pantomath could be used as a module in Maltego, meaning that the state-of-the-art visualization capabilities of Maltego could be utilized. Another significant drawback of Pantomath is the lower number of imple- mented modules. As discussed in detail in Chapter 3, many possible sources could be utilized. The following are the more notable ones that would fit well in Pantomath: • BuiltWith [111] is a useful service that provides information about the technology stack of a website, relationships between different web- sites, redirects of a website, and so forth. These results could help with the reliability estimation in the port_discovery module. However, the price for this service is very high, starting at $295 per month.

• Traditional and dark web search engines are a powerful source of OSINT when used correctly, e.g., utilizing Google Dorking [119] or other techniques. With a direct search of the target, the search engines could be able to extract valuable information.

• Many services provide information about Bitcoin addresses, e.g., the bal- ance of the address, whether a scammer or a hacker used the address, and much more.

• For user names and real names as targets, websites such as Pipl [84], CheckUserNames [88], or Social Searcher [95] could be utilized.

• The Wikileaks [11] website provides access to various breaches, which would help search for additional information about e-mail addresses and domains that were breached in the past.

• To collect offline data and add another source for reliability estimation in the port_discovery module, Nmap [114] or MASSCAN [115] can be used. However, actively testing open ports in bulk would require additional safeguards to prevent getting banned by many IP ranges. The selection of data in the offline mode could be improved either by implementing the existing services directly in Pantomath and collecting

57 5. Evaluation and Discussion the data locally or by using web crawling and state-of-the-art algorithms to extract valuable information. However, as discussed in Section 4.1, these are non-trivial tasks going against the main focus of Pantomath. To get data from online modules at least partially, many targets that might be needed in the future could be collected in bulk and saved in the database. This approach would require some prior knowledge about the potential tar- gets. Additionally, data breaches and pastebins could be acquired, but these are generally hard to get and could be considered unethical or even illegal. WhoisXMLAPI [162] offers a complete download of their whois and reverse whois database. The reliability is estimated in three modules – geolocation, port_discovery, and threat_intel. The way it is implemented in these modules could be used as a blueprint for other modules where it makes sense. For example, the black- lists module downloads many blacklists, but these are not compared because they have different metrics for the addition of new values. Assuming there would be multiple comparable blacklists, e.g., DNS-based blacklists, a stan- dard type of blacklists, they could be used for reliability estimation. In gen- eral, the reliability of any piece of information can be evaluated as long as multiple sources are providing this information. The port_discovery module could be extended to compare software and the transport protocol used on given ports, the network owner, and other values besides just the open ports. Another improvement in terms of reliability estimation would be the com- parison of data between different modules. All the estimation is currently performed separately in each module, but many modules partially overlap in the data they provide. For example, the certificates acquired from the pas- sive_ssl module are easily comparable with certificates from the crtsh mod- ule. Likewise, data from the passive_dns module that are also returned by the threat_intel module could support the information obtained from the dns module. Part of the port_discovery module data could be verified using data collected from BuiltWith to provide a more complex reliability estimation. Both port discovery services and BuiltWith give information on the used software, but for different targets – IP addresses and domains. However, this evaluation would require a non-trivial algorithm matching diverse results together.

58 6 Conclusions

The main goal of this thesis was to implement a tool for an automated collec- tion of Open Source Intelligence. Pantomath is a highly modular framework providing all the necessary functionality for data collection, processing, and storage, with a straightforward way to add new data sources. The selection of the implemented modules contains all essential services providing informa- tion about IP addresses, domain names, and e-mail addresses. These include port discovery services, IP geolocation websites, cyber threat intelligence feeds, blacklists, whois data, and much more. The framework can be used through a command-line interface or a simple API, with the possibility to add other interfaces such as a web interface. Pantomath offers three modes of operation with varying anonymity guar- antees. The overt mode represents a regular operation of the tool that uses all modules and sends requests directly to the implemented sources. The stealth mode routes all queries through the Tor network to provide an intermediary between the user and the Internet. The offline mode does not require an Inter- net connection as the target is looked up in a database of preprocessed data, allowing the users to query any targets completely anonymously. The selec- tion of data in the offline mode is smaller than in the overt or stealth modes, but additional data can be incorporated in the future. The reliability estimation model attempts to evaluate the reliability of the collected data, which is one of the biggest challenges of OSINT. The model calculates the reliability of each source used in a module by comparing the re- sults they return when specific targets are queried. The reliability estimation is currently implemented in three modules, but the approach can be applied to other modules as well. The reliabilities of the sources in all three modules were estimated using datasets of various targets and the results can serve as the base for future usage, as the estimates are updated continuously. There are many possible extensions and additional sources that can be added to the framework. The reliability estimation model can serve as a blueprint for other modules where the results from different sources are comparable. The modes of operation explore how higher anonymity require- ments affect the usability and the information one can find when confiden- tiality of the targets is critical. Pantomath lags behind existing OSINT au- tomation tools in the user interface and the number of implemented sources, but it builds the foundations for new concepts that are not yet explored in the existing tools.

59 Bibliography

[1] B. Schneier. Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World. W. W. Norton & Company, 2016. isbn: 978-0393352177. url: https://www.schneier.com/books/data_ and_goliath/. [2] A. Hulnick. “The Downside of Open Source Intelligence”. In: Interna- tional Journal of Intelligence and 15 (Nov. 2002), pp. 565–579. doi: 10.1080/08850600290101767. [3] S. Gibson. “Open source intelligence”. In: The RUSI Journal 149.1 (2004), pp. 16–22. doi: 10.1080/03071840408522977. eprint: https: //doi.org/10.1080/03071840408522977. url: https://doi.org/ 10.1080/03071840408522977. [4] Shodan. Shodan. [online], cit. [2020-7-10]. url: https://www.shodan. io. [5] T. Fingar. Reducing Uncertainty: and National Security. Stanford University Press, 2011. isbn: 9780804775946. url: https://books.google.cz/books?id=wmakl6eGkwYC. [6] L. Johnson. Handbook of Intelligence Studies. Taylor & Francis, 2007. isbn: 9781135986889. url: https://books.google.cz/books?id= U2yUAgAAQBAJ. [7] C. Burke. Freeing knowledge, telling secrets: Open source intelligence and development. Bond University, 2007. url: https://research. bond.edu.au/en/publications/freeing- knowledge- telling- secrets-open-sourceintelligence-and-dev. [8] C. Hobbs, M. Moran, and D. Salisbury. Open Source Intelligence in the Twenty-First Century. Palgrave Macmillan, London, 2014. url: https://link.springer.com/book/10.1057/9781137353320. [9] K. J. Riley et al. State and Local Intelligence in the War on Terrorism. RAND Corporation, 2005. isbn: 0-8330-3859-1. url: https://www. rand.org/pubs/monographs/MG394.html. [10] Intelligence Community Information Sharing Executive. U.S. Na- tional Intelligence: An Overview. Tech. rep. 2013. [11] J. Assange. WikiLeaks. [online], cit. [2020-7-30]. url: https://wiki leaks.org. [12] H. Gibson. “Acquisition and Preparation of Data for OSINT Inves- tigations”. In: Jan. 2016, pp. 69–93. isbn: 978-3-319-47670-4. doi: 10.1007/978-3-319-47671-1_6.

60 BIBLIOGRAPHY

[13] C. Perez and R. Germon. “Chapter 7 - Graph Creation and Anal- ysis for Linking Actors: Application to Social Data”. In: Automat- ing Open Source Intelligence. Ed. by R. Layton and P. A. Watters. Boston: Syngress, 2016, pp. 103 –129. isbn: 978-0-12-802916-9. doi: https : / / doi . org / 10 . 1016 / B978 - 0 - 12 - 802916 - 9 . 00007 - 5. url: http: // www. sciencedirect.com /science /article/ pii/ B9780128029169000075. [14] P. A. Watters. “Chapter 2 - Named Entity Resolution in Social Me- dia”. In: Automating Open Source Intelligence. Ed. by R. Layton and P. A. Watters. Boston: Syngress, 2016, pp. 21 –36. isbn: 978-0-12- 802916-9. doi: https://doi.org/10.1016/B978- 0- 12- 802916- 9 . 00002 - 6. url: http : / / www . sciencedirect . com / science / article/pii/B9780128029169000026. [15] C. Jouis et al. “Next Generation Search Engines: Advanced Models for Information Retrieval”. In: Jan. 2012, pp. 344–370. doi: 10.4018/ 978-1-4666-0330-1. [16] T. Dokman and T. Ivanjko. “Open Source Intelligence (OSINT): is- sues and trends”. In: Jan. 2020. doi: 10.17234/INFUTURE.2019.23. [17] L. Cox. “Some Limitations of Risk = Threat x Vulnerability x Con- sequence for Risk Analysis of Terrorist Attacks”. In: Risk analysis : an official publication of the Society for Risk Analysis 28 (Nov. 2008), pp. 1749–61. doi: 10.1111/j.1539-6924.2008.01142.x. [18] J. Whang et al. “Scalable Data-Driven PageRank: Algorithms, Sys- tem Issues, and Lessons Learned”. In: Aug. 2015, pp. 438–450. isbn: 978-3-662-48095-3. doi: 10.1007/978-3-662-48096-0_34. [19] W. Song et al. “An effective query recommendation approach using semantic strategies for intelligent information retrieval”. In: Expert Systems with Applications: An International Journal 41 (Feb. 2014), pp. 366–372. doi: 10.1016/j.eswa.2013.07.052. [20] A. M. Ponder-Sutton. “Chapter 1 - The Automating of Open Source Intelligence”. In: Automating Open Source Intelligence. Ed. by R. Lay- ton and P. A. Watters. Boston: Syngress, 2016, pp. 1 –20. isbn: 978-0- 12-802916-9. doi: https://doi.org/10.1016/B978-0-12-802916- 9 . 00001 - 4. url: http : / / www . sciencedirect . com / science / article/pii/B9780128029169000014. [21] M. Kandias et al. “Which side are you on? A new Panopticon vs. privacy”. In: 2013 International Conference on Security and Cryptog- raphy (SECRYPT). 2013, pp. 1–13. isbn: 978-9-8975-8131-1.

61 BIBLIOGRAPHY

[22] H. Bean. “Is open source intelligence an ethical issue?” In: Research in Social Problems and Public Policy 19 (Jan. 2011), pp. 385–402. doi: 10.1108/S0196-1152(2011)0000019024. [23] C. Kopp et al. “Chapter 8 - Ethical Considerations When Using On- line Datasets for Research Purposes”. In: Automating Open Source Intelligence. Ed. by R. Layton and P. A. Watters. Boston: Syngress, 2016, pp. 131 –157. isbn: 978-0-12-802916-9. doi: https://doi.org/ 10.1016/B978-0-12-802916-9.00008-7. url: http://www.scienc edirect.com/science/article/pii/B9780128029169000087. [24] J. Simola. “Privacy issues and critical infrastructure protection”. In: 2020. isbn: 9780128165942. doi: 10.1016/b978- 0- 12- 816203- 3. 00010-1. [25] A. Cavoukian. Privacy by Design – The 7 Foundational Principles. [online], cit. [2020-12-17]. 2010. url: https://www.ipc.on.ca/wp- content/uploads/Resources/7foundationalprinciples.pdf. [26] B.-J. Koops, J.-H. Hoepman, and R. Leenes. “Open-source intelli- gence and privacy by design”. In: Computer Law & Security Review 29 (Dec. 2013), 676–688. doi: 10.1016/j.clsr.2013.09.005. [27] P. Casanovas. “Cyber Warfare and Organised Crime. A Regulatory Model and Meta-Model for Open Source Intelligence (OSINT)”. In: Dec. 2017, pp. 139–167. isbn: 978-3-319-45299-9. doi: 10.1007/978- 3-319-45300-2_9. [28] J. Rajamäki and J. Simola. “How to apply privacy by design in OS- INT and big data analytics?” In: ECCWS 2019 - Proceedings of the 18th European Conference on Cyber Warfare and Security. June 2019, pp. 364–371. isbn: 9781912764280. [29] A. Gandomi and M. Haider. “Beyond the hype: Big data concepts, methods, and analytics”. In: International Journal of Information Management 35.2 (2015), pp. 137 –144. issn: 0268-4012. doi: https: //doi.org/10.1016/j.ijinfomgt.2014.10.007. url: http://www. sciencedirect.com/science/article/pii/S0268401214001066. [30] A. Powell and C. Haynes. “Social Media Data in Digital Forensics Investigations”. In: Jan. 2020, pp. 281–303. isbn: 978-3-030-23546-8. doi: 10.1007/978-3-030-23547-5_14. [31] G. Bello-Orgaz, J. J. Jung, and D. Camacho. “Social big data: Recent achievements and new challenges”. In: Information Fusion 28 (2016), pp. 45 –59. issn: 1566-2535. doi: https://doi.org/10.1016/j. inffus . 2015 . 08 . 005. url: http : / / www . sciencedirect . com / science/article/pii/S1566253515000780.

62 BIBLIOGRAPHY

[32] G. Kalpakis et al. “OSINT and the Dark Web”. In: Jan. 2016, pp. 111– 132. isbn: 978-3-319-47670-4. doi: 10.1007/978-3-319-47671-1_8. [33] B. Nafziger. “ in the Dark: Darknet Intelligence Automa- tion”. In: 2017. [34] M. Schäfer et al. “BlackWidow: Monitoring the Dark Web for Cyber Security Information”. In: May 2019, pp. 1–21. doi: 10.23919/CYCON. 2019.8756845. [35] H. Chen. Dark Web - Exploring and Data Mining the Dark Side of the Web. Springer-Verlag New York, 2012. isbn: 978-1-4614-1557-2. url: https://www.springer.com/gp/book/9781461415565. [36] J. Pastor-Galindo et al. “The not yet exploited goldmine of OSINT: Opportunities, open challenges and future trends”. In: IEEE Access PP (Jan. 2020), pp. 1–1. doi: 10.1109/ACCESS.2020.2965257. [37] R. A. Best Jr. and A. Cumming. Open Source Intelligence (OSINT): Issues for Congress. Congressional Research Service, 2007. url: htt ps://fas.org/sgp/crs/intel/RL34270.pdf. [38] T. Day, H. Gibson, and S. Ramwell. “Fusion of OSINT and Non- OSINT Data”. In: Jan. 2016, pp. 133–152. isbn: 978-3-319-47670-4. doi: 10.1007/978-3-319-47671-1_9. [39] R. Scrivens et al. “Searching for Extremist Content Online Using The Dark Crawler and Sentiment Analysis”. In: Aug. 2019, pp. 179–194. isbn: 978-1-78769-866-6. doi: 10.1108/s1521-613620190000024016. [40] E. Susnea. “A Real-Time Social Media Monitoring System as an Open Source Intelligence (Osint) Platform for Early Warning in Crisis Situ- ations”. In: International conference KNOWLEDGE-BASED ORGA- NIZATION 24 (June 2018), pp. 427–431. doi: 10.1515/kbo-2018- 0127. [41] L. Ball. “Automating social network analysis: A power tool for counter-terrorism”. In: Security Journal 29 (Feb. 2013). doi: 10 . 1057/sj.2013.3. [42] M. Dawson, M. Lieble, and A. Adeboje. “Open Source Intelligence: Performing Data Mining and Link Analysis to Track Terrorist Activi- ties”. In: Information Technology - New Generations. Ed. by S. Latifi. Cham: Springer International Publishing, 2018, pp. 159–163. isbn: 978-3-319-54978-1. [43] S. Carruthers. Social Engineering - A Proactive Security. [online], cit. [2020-8-14]. 2018. url: https : / / www . mindpointgroup . com / wp - content / uploads / 2018 / 08 / Social - Engineering - Whitepaper - Part-Three-Phishing.pdf.

63 BIBLIOGRAPHY

[44] M. Edwards et al. “Panning for gold: Automatically analysing on- line social engineering attack surfaces”. In: Computers & Security 69 (2017). Security Data Science and Cyber Threat Management, pp. 18 –34. issn: 0167-4048. doi: https://doi.org/10.1016/j.cose.2016. 12.013. url: http://www.sciencedirect.com/science/article/ pii/S0167404816301845. [45] D. Hayes and F. Cappa. “Open-source intelligence for risk assess- ment”. In: Business Horizons 61 (Mar. 2018). doi: 10 . 1016 / j . bushor.2018.02.001. [46] A. Cartagena et al. “Privacy Violating Opensource Intelligence Threat Evaluation Framework: A Security Assessment Framework For Crit- ical Infrastructure Owners”. In: Jan. 2020, pp. 0494–0499. doi: 10. 1109/CCWC47524.2020.9031172. [47] Y. Tanaka and S. Kashima. “SeedsMiner: Accurate URL Blacklist- Generation Based on Efficient OSINT Seed Collection”. In: Oct. 2019, pp. 250–255. isbn: 978-1-4503-6988-6. doi: 10.1145/3358695. 3361751. [48] D. Quick and K.-K. R. Choo. “Digital forensic intelligence: Data sub- sets and Open Source Intelligence (DFINT+OSINT): A timely and co- hesive mix”. In: Future Generation Computer Systems 78 (Dec. 2016). doi: 10.1016/j.future.2016.12.032. [49] I. Vacas, I. Medeiros, and N. Neves. “Detecting Network Threats using OSINT Knowledge-Based IDS”. In: 2018 14th European Dependable Computing Conference (EDCC). 2018, pp. 128–135. [50] S. Lee et al. “Managing Cyber Threat Intelligence in a Graph Database: Methods of Analyzing Intrusion Sets, Threat Actors, and Campaigns”. In: Jan. 2018, pp. 1–6. doi: 10.1109/PlatCon.2018. 8472752. [51] C. Best. “Web Mining for Open Source Intelligence”. In: 2008 12th International Conference Information Visualisation. 2008, pp. 321– 325. [52] F. Neri, C. Aliprandi, and F. Camillo. “Mining the Web to Monitor the Political Consensus”. In: May 2011, pp. 391–412. isbn: 978-3-7091- 0387-6. doi: 10.1007/978-3-7091-0388-3_19. [53] C. Fleisher. “Using Open Source Data in Developing Competitive and Market Intelligence”. In: European Journal of Marketing 42 (July 2008), pp. 852–866. doi: 10.1108/03090560810877196. [54] A. Magalhães and J. a. P. Magalhães. “TExtractor: An OSINT Tool to Extract and Analyse Audio/Video Content”. In: Innovation, En-

64 BIBLIOGRAPHY

gineering and Entrepreneurship. Springer International Publishing, 2019, pp. 3–9. isbn: 978-3-319-91334-6. [55] P. Maciolek and G. Dobrowolski. “Cluo: Web-Scale Text Mining Sys- tem For Open Source Intelligence Purposes”. In: Computer Science 14 (Jan. 2013). doi: 10.7494/csci.2013.14.1.45. [56] S. Gong, J. Cho, and C. Lee. “A Reliability Comparison Method for OSINT Validity Analysis”. In: IEEE Transactions on Industrial Informatics 14.12 (2018), pp. 5428–5435. [57] D. Jurafsky and J. Martin. Speech and Language Processing: An Intro- duction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Vol. 2. Feb. 2008. [58] Google LLC. Google. [online], cit. [2020-8-27]. url: https://www. google.com. [59] Apple Inc. Apple Siri. [online], cit. [2020-8-27]. url: https://www. apple.com/siri/. [60] S. Noubours, A. Pritzkau, and U. Schade. “NLP as an essential ingre- dient of effective OSINT frameworks”. In: 2013 Military Communica- tions and Information Systems Conference. 2013, pp. 1–7. [61] R. Layton et al. “Indirect Information Linkage for OSINT through Authorship Analysis of Aliases”. In: vol. 7867. Apr. 2013. doi: 10. 1007/978-3-642-40319-4_4. [62] K. Li et al. “Security OSIF: Toward Automatic Discovery and Analy- sis of Event Based Cyber Threat Intelligence”. In: Oct. 2018, pp. 741– 747. doi: 10.1109/SmartWorld.2018.00142. [63] G. Ganino et al. “Ontology population for open-source intelligence: A GATE-based solution”. In: Software: Practice and Experience (Sept. 2018). doi: 10.1002/spe.2640. [64] W3C. Ontologies. [online], cit. [2020-9-4]. 2018. url: https://www. w3.org/standards/semanticweb/ontology. [65] L. Serrano et al. “Events Extraction and Aggregation for Open Source Intelligence: From Text to Knowledge”. In: Nov. 2013, pp. 518–523. isbn: 978-1-4799-2972-6. doi: 10.1109/ICTAI.2013.83. [66] University of Maryland. Global Terrorism Database. [online], cit. [2020-7-30]. url: https://www.start.umd.edu/gtd/. [67] E. Alpaydin. Introduction to Machine Learning. Adaptive Com- putation and Machine Learning series. MIT Press, 2020. isbn: 97802620437-93. url: https : / / books . google . cz / books ? id = tZnSDwAAQBAJ.

65 BIBLIOGRAPHY

[68] M. Jordan and T. Mitchell. “Machine Learning: Trends, Perspec- tives, and Prospects”. In: Science (New York, N.Y.) 349 (July 2015), pp. 255–60. doi: 10.1126/science.aaa8415. [69] H. Pellet, S. Shiaeles, and S. Stavrou. “Localising social network users and profiling their movement”. In: Computers & Security 81 (2019), pp. 49 –57. issn: 0167-4048. doi: https://doi.org/10.1016/j. cose.2018.10.009. [70] Novetta. CLAVIN (Cartographic Location And Vicinity INdexer). [on- line], cit. [2020-8-16]. url: https://github.com/Novetta/CLAVIN. [71] P. Ranade et al. “Using Deep Neural Networks to Translate Multi- lingual Threat Intelligence”. In: Nov. 2018, pp. 238–243. doi: 10 . 1109/ISI.2018.8587374. [72] F. Alves, P. Ferreira, and A. Bessani. “Design of a Classification Model for a Twitter-Based Streaming Threat Monitor”. In: June 2019, pp. 9– 14. doi: 10.1109/DSN-W.2019.00010. [73] B. Mohit. “Named Entity Recognition”. In: Mar. 2014, pp. 221–245. isbn: 978-3-642-45357-1. doi: 10.1007/978-3-642-45358-8_7. [74] I. Deliu, C. Leichter, and K. Franke. “Extracting cyber threat intel- ligence from hacker forums: Support vector machines versus convolu- tional neural networks”. In: Dec. 2017, pp. 3648–3656. doi: 10.1109/ BigData.2017.8258359. [75] J. Gu et al. “Recent Advances in Convolutional Neural Networks”. In: Pattern Recognition (Dec. 2015). doi: 10.1016/j.patcog.2017.10. 013. [76] S. Mittal, A. Joshi, and T. Finin. “Cyber-All-Intel: An AI for Security related Threat Intelligence”. In: May 2019. [77] DarkSearch.io. DarkSearch. [online], cit. [2020-7-13]. url: https:// darksearch.io. [78] S. Chauhan and N. K. Panda. Hacking Web Intelligence - Open Source Intelligence and Web Reconnaissance Concepts and Techniques. Else- vier Inc., 2015. isbn: 978-0-12-801867-5. url: https://doi.org/10. 1016/C2014-0-00876-3. [79] Q. Revell, T. Smith, and R. Stacey. “Tools for OSINT-Based Inves- tigations”. In: Jan. 2016, pp. 153–165. isbn: 978-3-319-47670-4. doi: 10.1007/978-3-319-47671-1_10. [80] A. Bielska et al. Open Source Intelligence Tools and Resources Hand- book. [online], cit. [2020-8-27]. 2018. url: https://i-intelligence. eu / uploads / public - documents / OSINT _ Handbook _ June - 2018 _ Final.pdf.

66 BIBLIOGRAPHY

[81] J. Nordine. OSINT Framework. [online], cit. [2020-7-10]. url: https: //osintframework.com. [82] M. Hoffman. Your OSINT Graphical Analyzer (YOGA). [online], cit. [2020-7-13]. url: https://yoga.osint.ninja. [83] B. Mortier. OSINT Open Source Intelligence Framework. [online], cit. [2020-7-13]. url: https://start.me/p/ZME8nR/osint. [84] Pipl. Pipl. [online], cit. [2020-7-29]. url: https://pipl.com. [85] Ancestry. Ancestry. [online], cit. [2020-7-29]. url: https://www.anc estry.com. [86] B. Sanders. MailTester. [online], cit. [2020-8-22]. url: https://mail tester.com. [87] T. Hunt. Have I Been Pwned? [online], cit. [2020-7-14]. url: https: //haveibeenpwned.com. [88] KnowEm? CheckUserNames. [online], cit. [2020-7-14]. url: https : //checkusernames.com. [89] M. Hoffman. WhatsMyName. [online], cit. [2020-7-14]. url: https: //github.com/WebBreacher/WhatsMyName. [90] sundowndev. PhoneInfoga. [online], cit. [2020-7-29]. url: https:// github.com/sundowndev/PhoneInfoga. [91] Facebook. Facebook for Developers. [online], cit. [2020-8-25]. url: ht tps://developers.facebook.com. [92] Twitter, Inc. Twitter API. [online], cit. [2020-8-25]. url: https:// developer.twitter.com/en/docs/twitter-api. [93] LinkedIn Corporation. LinkedIn Developers. [online], cit. [2020-8-25]. url: https://www.linkedin.com/developers/. [94] Tinfoleak. Tinfoleak. [online], cit. [2020-8-20]. url: https://tinfol eak.com/. [95] Social Searcher. Social Searcher. [online], cit. [2020-8-25]. url: https: //www.social-searcher.com. [96] webdevmedia. DNSlytics. [online], cit. [2020-7-14]. url: https : / / dnslytics.com. [97] IPinfo. IPinfo. [online], cit. [2020-7-14]. url: https://ipinfo.io. [98] IKnowWhatYouDownload. I Know What You Download. [online], cit. [2020-8-20]. url: https://iknowwhatyoudownload.com. [99] Sectigo Limited. crt.sh. [online], cit. [2020-7-15]. url: https://crt. sh. [100] urlscan GmbH. urlscan.io. [online], cit. [2020-7-15]. url: https:// urlscan.io. [101] Cisco Systems, Inc. SpamCop. [online], cit. [2020-7-15]. url: https: //www.spamcop.net.

67 BIBLIOGRAPHY

[102] SURBL. SURBL. [online], cit. [2020-7-15]. url: http://www.surbl. org. [103] SORBS. SORBS. [online], cit. [2020-7-15]. url: http : / / www . us . sorbs.net. [104] abuse.ch. Abuse.ch SSL Blacklist. [online], cit. [2020-7-15]. url: http s://sslbl.abuse.ch. [105] RickAnalytics. Malware Domain Blocklist. [online], cit. [2020-7-15]. url: http://www.malwaredomains.com. [106] FireHOL. FireHOL IP Lists. [online], cit. [2020-7-15]. url: https: //iplists.firehol.org. [107] AlienVault. Open Threat Exchange. [online], cit. [2020-7-22]. url: ht tps://otx.alienvault.com. [108] AlienVault. ThreadCrowd. [online], cit. [2020-7-22]. url: https:// www.threatcrowd.org. [109] OPSWAT, Inc. MetaDefender. [online], cit. [2020-7-22]. url: https: //metadefender.opswat.com/. [110] Fortinet. FortiGuard Labs. [online], cit. [2020-7-22]. url: http : / / fortiguard.com. [111] BuiltWith R Pty Ltd. BuiltWith. [online], cit. [2020-7-15]. url: https: //builtwith.com. [112] Spyse. Spyse. [online], cit. [2020-7-10]. url: https://spyse.com. [113] Censys. Censys. [online], cit. [2020-7-22]. url: https://censys.io. [114] Nmap. Nmap. [online], cit. [2020-7-15]. url: https://github.com/ nmap/nmap. [115] R. D. Graham. MASSCAN. [online], cit. [2020-7-15]. url: https : //github.com/robertdavidgraham/masscan. [116] Google LLC. Google Custom Search. [online], cit. [2020-7-10]. url: https://developers.google.com/custom-search. [117] Microsoft Corporation. Bing Web Search API. [online], cit. [2020-7-10]. url: https://azure.microsoft.com/en-us/services/cognitive- services/bing-web-search-api/. [118] DuckDuckGo, Inc. DuckDuckGo Instant Answer API. [online], cit. [2020-7-10]. url: https://api.duckduckgo.com/api. [119] Offensive Security. Google Hacking Database. [online], cit. [2020-7-14]. url: https://www.exploit-db.com/google-hacking-database. [120] Ahmia. Ahmia. [online], cit. [2020-7-13]. url: https://ahmia.fi. [121] TorchSearch.net. Torch Search Engine. [online], cit. [2020-7-13]. url: https://torchsearch.net. [122] OnionSearchEngine.com. Onion Search Engine. [online], cit. [2020-7- 13]. url: https://onionsearchengine.com.

68 BIBLIOGRAPHY

[123] E. Maor. Kilos: The Dark Web’s Newest – and Most Extensive – Search Engine. [online], cit. [2020-7-13]. url: https://intsights. com/blog/kilos-the-dark-webs-newest-and-most-extensive- search-engine. [124] PublicWWW. PublicWWW. [online], cit. [2020-7-14]. url: https:// publicwww.com. [125] B. Boyter. Searchcode. [online], cit. [2020-7-14]. url: https://searc hcode.com. [126] M. Fagan. Fagan Finder. [online], cit. [2020-8-31]. url: https://www. faganfinder.com/. [127] ElevenPaths. FOCA. [online], cit. [2020-7-29]. url: https://github. com/ElevenPaths/FOCA. [128] Edge-Security. Metagoofil. [online], cit. [2020-7-29]. url: http://www. edge-security.com/metagoofil.php. [129] P. Harvey. ExifTool. [online], cit. [2020-7-29]. url: https://exiftoo l.org. [130] Google LLC. Google Images. [online], cit. [2020-7-29]. url: https: //images.google.com. [131] Flickr. Flickr Map. [online], cit. [2020-8-31]. url: https://www.flic kr.com/map. [132] A. Mohawk. PasteLert. [online], cit. [2020-7-30]. url: https://www. andrewmohawk.com/pasteLert/. [133] A. Musciano. Sniff-Paste: OSINT Pastebin Harvester. [online], cit. [2020-7-30]. url: https://github.com/needmorecowbell/sniff- paste. [134] Internet Archive. Wayback Machine. [online], cit. [2020-7-31]. url: https://web.archive.org. [135] Web Scraper. Web Scraper. [online], cit. [2020-7-31]. url: https:// webscraper.io. [136] ScraperAPI. ScraperAPI. [online], cit. [2020-7-31]. url: https://www. scraperapi.com. [137] ScrapeSimple. ScrapeSimple. [online], cit. [2020-7-31]. url: https : //www.scrapesimple.com. [138] Scrapinghub. Scrapy. [online], cit. [2020-7-31]. url: https://scrapy. org. [139] Ahmia. Ahmia Crawler. [online], cit. [2020-7-23]. url: https://git hub.com/ahmia/ahmia-crawler. [140] Ahmia. Ahmia Index. [online], cit. [2020-7-23]. url: https://github. com/ahmia/ahmia-index.

69 BIBLIOGRAPHY

[141] Intelligence X. Intelligence X. [online], cit. [2020-7-22]. url: https: //intelx.io. [142] ShadowDragon, LLC. ShadowDragon. [online], cit. [2020-7-22]. url: https://shadowdragon.io. [143] Edge-Security. theHarvester. [online], cit. [2020-7-15]. url: https : //github.com/laramies/theHarvester. [144] T. Tomes. Recon-ng. [online], cit. [2020-7-15]. url: https://github. com/lanmaster53/recon-ng. [145] T. Tomes. Recon-ng Marketplace. [online], cit. [2020-7-22]. url: http s://github.com/lanmaster53/recon-ng-marketplace. [146] M. Technologies. Maltego. [online], cit. [2020-7-16]. url: https:// www.maltego.com. [147] Maltego Technologies. Maltego Transform Hub. [online], cit. [2020-7- 24]. url: https://www.maltego.com/transform-hub/. [148] Spread Security. Open Source Intelligence with Maltego. [online], cit. [2020-12-20]. url: https://spreadsecurity.github.io/2016/09/ 03/open-source-intelligence-with-maltego.html. [149] SpiderFoot. SpiderFoot. [online], cit. [2020-7-10]. url: https://www. spiderfoot.net. [150] SpiderFoot. SpiderFoot Documentation. [online], cit. [2020-7-25]. url: https://www.spiderfoot.net/documentation/. [151] The PostgreSQL Global Development Group. PostgreSQL. [online], cit. [2020-12-6]. url: https://www.postgresql.org/. [152] OpenStreetMap. OpenStreetMap. [online], cit. [2020-12-6]. url: http s://www.openstreetmap.org/. [153] freegeoip.app. freegeoip.app. [online], cit. [2020-12-6]. url: http:// freegeoip.app. [154] ipify.org. ipify IP Geolocation API. [online], cit. [2020-12-6]. url: ht tp://geo.ipify.org. [155] ip api.com. ip-api.com. [online], cit. [2020-12-6]. url: http://ip- api.com. [156] ipgeolocation.io. ipgeolocation.io. [online], cit. [2020-12-6]. url: http: //ipgeolocation.io. [157] ipdata.co. ipdata.co. [online], cit. [2020-12-6]. url: http://ipdata. co. [158] eXTReMe digital. eXTReMe-IP-LOOKUP. [online], cit. [2020-12-6]. url: http://extreme-ip-lookup.com. [159] GEOPLUGIN, SAS. geoPlugin. [online], cit. [2020-12-6]. url: http: //geoplugin.net.

70 BIBLIOGRAPHY

[160] ipwhois.io. ipwhois.io. [online], cit. [2020-12-6]. url: http://ipwhois. io. [161] Ipregistry. Ipregistry. [online], cit. [2020-12-6]. url: http://ipregis try.co. [162] WHOIS API, Inc. WhoisXMLAPI. [online], cit. [2020-11-16]. url: h ttps://www.whoisxmlapi.com. [163] IPLocate.io. IPLocate.io. [online], cit. [2020-12-6]. url: http://iplo cate.io. [164] Utrace.de. Utrace.de. [online], cit. [2020-12-6]. url: http://utrace. de. [165] MaxMind, Inc. MaxMind GeoIP. [online], cit. [2020-12-6]. url: https: //www.maxmind.com/en/geoip2-databases. [166] VirusTotal. VirusTotal. [online], cit. [2020-12-6]. url: http://virus total.com. [167] abuse.ch. Feodotracker. [online], cit. [2020-12-6]. url: http://feodo tracker.abuse.ch. [168] AlientVault. AlienVault reputation list. [online], cit. [2020-12-6]. url: http://reputation.alienvault.com/reputation.generic. [169] CINSscore.com. CINSscore.com. [online], cit. [2020-12-6]. url: http: //cinsscore.com. [170] Blocklist.de. Blocklist.de Lists. [online], cit. [2020-12-6]. url: http: //lists.blocklist.de/lists. [171] T. S. P. SLU. Spamhaus DROP list. [online], cit. [2020-12-6]. url: http://spamhaus.org/drop. [172] OpenPhish. OpenPhish. [online], cit. [2020-12-6]. url: http://openp hish.com. [173] ZeroDot1. ZeroDot1 CoinBlockerLists. [online], cit. [2020-12-6]. url: https://zerodot1.gitlab.io/CoinBlockerListsWeb/. [174] L. Hansliková. “Categorizing and visualizing the Dark Web”. Master’s thesis. Masarykova univerzita, Fakulta informatiky, Brno, 2020 [cit. 2020-11-16]. url: https://is.muni.cz/th/noh49/. [175] N. Galbreath. ipcat. [online], cit. [2020-11-16]. url: https://github. com/client9/ipcat/. [176] Google LLC. Google. [online], cit. [2020-11-16]. url: https://dns. google.com/. [177] Public-DNS. Public-DNS. [online], cit. [2020-11-16]. url: https:// public-dns.info/. [178] F. Denis. IP2ASN. [online], cit. [2020-11-16]. url: https://iptoasn. com/.

71 BIBLIOGRAPHY

[179] MultiProxy.org. MultiProxy. [online], cit. [2020-11-16]. url: http:// multiproxy.org/. [180] CIRCL.LU. Passive DNS. [online], cit. [2020-11-16]. url: http:// circl.lu/services/passive-dns/. [181] CIRCL.LU. Passive SSL. [online], cit. [2020-11-16]. url: http : / / circl.lu/services/passive-ssl/. [182] The.Earth.li. The.Earth.li. [online], cit. [2020-11-16]. url: http:// the.earth.li/. [183] C. T. Terrón. PGP Public Key Server. [online], cit. [2020-11-16]. url: http://pgp.key-server.io/. [184] psbdmp. Pastebin Dump. [online], cit. [2020-11-16]. url: https:// psbdmp.cc/. [185] spyonweb.com. SpyOnWeb. [online], cit. [2020-11-16]. url: https:// spyonweb.com/. [186] Tor Project. Tor Project. [online], cit. [2020-11-16]. url: https:// torproject.org. [187] Whoxy.com. Whoxy. [online], cit. [2020-11-16]. url: https://www. whoxy.com/. [188] Whoisology. Whoisology. [online], cit. [2020-11-16]. url: https:// whoisology.com/. [189] R. Firestein. Cymon API v2. [online], cit. [2020-12-6]. url: http : //cymon.docs.apiary.io. [190] Amazon.com Inc. Amazon Global Infrastructure. [online], cit. [2020- 12-25]. url: https://aws.amazon.com/about-aws/global-infras tructure/. [191] Google LLC. Google Data Centers Locations. [online], cit. [2020-12-25]. url: https://www.google.com/about/datacenters/locations/. [192] NordVPN. NordVPN Servers. [online], cit. [2020-12-25]. url: https: //nordvpn.com/servers/.

72 A Appendices

Electronic attachments content

• /Pantomath/ – Directory containing the Pantomath tool

– ./config/ – Directory containing all configuration files with set- tings for API keys, PostgreSQL, Tor, reliability estimates, etc. – ./README.md – File containing instructions for installing and running Pantomath, more detailed description is on the next page – ./pantomath_cli – The Pantomath command-line interface – ./docker-compose.yml – The file used for deployment of Pan- tomath inside containers using Docker-compose

• /CLAVIN/ – Directory containing CLAVIN plugin for resolution of locations in the dark web dataset

• /Results/ – Directory containing results in the form of JSON files for queries of muni.cz (search depth of 1), fi.muni.cz (search depth of 0), and crocs.fi.muni.cz (search depth of 0)

• /Reliability/ – Directory containing the datasets used for reliability testing and the results of the testing

73 A. Appendices Installing and Using Pantomath

The easiest way to install and run Pantomath is to install docker-compose and build the required containers using the docker-compose.yml file in the /Pantomath/ directory. To do that, run the following sequence of commands. Build both services: $ docker-compose build Run the database container in a detached mode: $ docker-compose up -d postgres Run the Pantomath container, which opens the command-line interface in- side the container: $ docker-compose run pantomath When finished, run the following command to stop all running components: $ docker-compose down There are other deployment options as well. Pantomath needs the Post- greSQL database to be running with the appropriate settings set in the configuration file. The database container can be launched as before and Pantomath can be installed and used locally. To install Pantomath on a Debian system, install the following packages: $ apt-get install python3 python3-pip python3-dev libc6-dev ,→ gcc libxslt1-dev build-essential libc6-dev libssl-dev ,→ libffi-dev postgresql-server-dev-all tor Then install the Python libraries using the requirements.txt file: $ pip3 install -r requirements.txt The command-line interface can be launched using the script pantomath_cli in the /Pantomath/ directory: $ ./pantomath_cli When the command-line interface is up and running, the help command can be used to see all available commands with a description. The modules command shows all available modules, the arguments they take as an input, whether they provide offline data, and their description. A typical workflow starts with the update of the offline data using the update command. Once the database contains fresh data, the query command can be used with arbitrary IP address, domain name, or e-mail address as a target:

74 A. Appendices

[pantomath]$ query DOMAIN muni.cz To specify how deep the search should be, an optional argument search_depth can be given to the command: [pantomath]$ query IPv4 147.251.5.239 2 When all targets with given depth are queried, the CLI asks the user whether the search should continue. The default search depth is set to 0, i.e., only the first target is searched. The modes can be changed using the overt, stealth, and offline commands, which also changes the prompt to see which mode is currently on: [pantomath][offline]$

[pantomath][stealth]$ After the search is over, the results can either be printed to the standard output or exported to a JSON file using the print and export commands.

75