Automated Collection of Open Source Intelligence

Masaryk University Faculty of Informatics Automated Collection of Open Source Intelligence Master’s Thesis Bc. Ondřej Zoder Brno, Fall 2020 Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or ex- cerpted during elaboration of this work are properly cited and listed in com- plete reference to the due source. Bc. Ondřej Zoder Advisor: RNDr. Lukáš Němec i Acknowledgements I would like to thank my advisor RNDr. Lukáš Němec for his guidance throughout the entirety of this thesis. My thanks also go to RNDr. Martin Stehlík, Ph.D. who provided many valuable suggestions and helped shape the tool that is the outcome of this thesis. Huge appreciation goes to my family for the support they have given me during all the years of my studies. I also want to thank Cedric from CIRCL.LU for providing me free access to their Passive DNS and Passive SSL databases, and Gregory from Spyse for giving me a free trial for their port discovery service, which allowed me to extend Pantomath and further test the reliability estimation model. ii Abstract With the ever-growing amount of data available on the Internet and the wide- spread adoption of social media networks, publicly accessible websites have grown into a goldmine of valuable information about individuals and companies. Open Source Intelligence, shortly OSINT, is any information obtain- able legally and ethically from publicly available sources addressing specific intelligence requirements. The relatively easy and cheap integration makes OSINT a practical solution for national security, cyber threat intelligence, and many other fields. This thesis presents a framework called Pantomath for an automated collection of OSINT that utilizes many existing tools and services. The framework is highly modular, provides all the functionality needed throughout the whole process of OSINT, offers three modes of operation for different anonymity requirements, and presents the data in a struc- tured output. The reliability of some of the collected data is estimated to allow the user to analyze the data more efficiently and precisely. The framework is compared to existing OSINT automation tools, and the most notable advantages and disadvantages are discussed. iii Keywords OSINT, open-source intelligence, OSINT automation, military intelligence, social media intelligence, threat intelligence, Pantomath iv Contents 1 Introduction 1 2 Open Source Intelligence 3 2.1 Challenges .............................5 2.1.1 Legal and Ethical Aspects . .6 2.2 Value and Use Cases .......................8 2.2.1 Military Intelligence . .9 2.2.2 Cybersecurity . 10 2.2.3 Social and Business Intelligence . 11 2.3 State-of-the-Art .......................... 12 2.3.1 Natural Language Processing . 13 2.3.2 Machine Learning . 14 3 OSINT Sources and Tools 17 3.1 Overview ............................. 17 3.2 OSINT Automation ....................... 23 3.2.1 Recon-ng . 24 3.2.2 Maltego . 25 3.2.3 SpiderFoot . 26 4 Pantomath: Tool for Automated OSINT Collection 28 4.1 Problem Statement ........................ 28 4.2 Architecture and Functionality ................. 30 4.2.1 Base Framework . 31 4.2.2 Modes of Operation . 33 4.2.3 Modules . 35 4.3 Reliability Estimation ...................... 41 4.3.1 Cyber Threat Intelligence . 44 4.3.2 Geolocation . 45 4.3.3 Port Discovery . 47 5 Evaluation and Discussion 49 5.1 Evaluation of Reliability Estimation .............. 49 5.1.1 Cyber Threat Intelligence . 49 5.1.2 Geolocation . 50 5.1.3 Port Discovery . 53 5.2 Comparison with Existing Tools ................ 54 5.3 Future Work ........................... 56 v 6 Conclusions 59 Bibliography 60 A Appendices 73 vi 1 Introduction With the exponential growth of the Internet in the last few decades, the amount of data stored around the world has become immeasurable. It is estimated that four of the biggest online companies, Amazon, Microsoft, Google, and Facebook, store at least 1.2 million terabytes of data. At first, data was thought of as a mere by-product of computing, but it has even- tually grown into a product itself [1]. Companies sell their users’ data to others that benefit from it, so collecting data of any value is essential for many. A large portion of Internet data is accessible to anyone with an In- ternet connection and often contains a lot of knowledge about individuals, companies, or governments. All this data is commonly called Open Source Intelligence, or shortly OSINT. The value of OSINT is increasingly getting recognized in many different fields. According to [2], over 80% of the knowledge used for policymaking on a national level is derived from OSINT. Cyber threat intelligence heavily utilizes OSINT and combines it with data collected by security devices to evaluate possible threats to companies’ infrastructures. All in all, publicly available sources constitute an irreplaceable source of knowledge. However, due to the immense amount of data on the Internet and its unstructured and heterogeneous nature, the collection and processing of OSINT is a challeng- ing task requiring non-trivial methods. Arguably one of the biggest drawbacks of OSINT is the lack of mechanisms for verification of the collected information [3]. To make the whole process of OSINT easier and more accessible, various tools and services that provide useful information exist. These range from simple websites that provide basic information about IP addresses to more complex tools implementing state-of-the-art algorithms, such as Shodan [4]. A framework called Pantomath for an automated collection of OSINT is presented in this thesis. The framework utilizes existing tools and services that provide valuable information about Internet identifiers, such as IP addresses or domain names. As the number of these services is enormous, Pantomath was designed to make the integration of new sources more straightforward by moving the data collection to separate modules, which can be added by merely implementing a well-defined interface. To address the user’s anonymity requirements, Pantomath offers three modes of operation with varying guarantees and drawbacks. The overt mode represents a regular operation where all sources are used, and an Internet connection is required. In the stealth mode, all requests sent to the Internet 1 1. Introduction are proxied through the Tor network. The offline mode provides the highest guarantees for the user’s anonymity, as only a database of preprocessed data is queried, and no Internet connection is needed. Pantomath also attempts to tackle possibly the biggest challenge of OSINT – the validation of the gath- ered data. A mathematical model for reliability estimation of the results is defined and used in several modules. The thesis is organized as follows. Chapter 2 introduces OSINT, discusses some of the challenges, the value it provides, the fields where OSINT is often utilized, and a few state-of-the-art techniques that improve the efficiency of OSINT collection. Chapter 3 outlines the sources that can be used to gather the data and some tools that aim to automatize this process. Pantomath, a tool for an automated collection of OSINT, is presented in Chapter 4. Chapter 5 evaluates Pantomath, compares it to tools with similar goals, and drafts the possible extensions and improvements. 2 2 Open Source Intelligence Intelligence is a process of information gathering for the purpose of providing a clear understanding of issues, allowing responsible people to make indepen- dent and impartial decisions [3]. Thomas Fingar [5] states that the primary purpose of intelligence is to reduce uncertainty about intentions, capabilities, and actions of adversaries and allies. To be of any value, intelligence must be up-to-date, accurate, relevant, and verifiable. The goal of intelligence is not only to collect data but also to identify parts of the data that are valuable for the issue at hand, link them together, and evaluate them. Open Source Intelligence (OSINT) is an intelligence based on information that can be obtained legally and ethically from publicly available sources [6]. OSINT is considered to be the oldest form of intelligence gathering, with its earliest usage going as far as the Second World War, where radio and print sources were used [7]. However, its utility increased significantly with the emergence of information technologies and the Internet in particular [8]. It is estimated that over 80% of knowledge used for policymaking on a national level is derived from OSINT [2, 9]. OSINT is a broad term, and the exact definitions can vary depending on the field of study. The Office of the Director of National Intelligence of the U.S. [10] defines it as intelligence produced from publicly available information that is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement. They state that the sources of OSINT include mass media, public data, gray literature, and observation and reporting. For the purpose of this thesis, the definition will be narrowed down to intelligence based on information openly accessible over the Internet. The Internet itself is not considered as a source of OSINT, but rather a platform through which the sources are accessed. There are borderline cases of sources that might not be regarded as part of OSINT by some definitions, e.g., any private information that was made public even though it was not the intention of the owner of that information. That might occur due to some error, e.g., a misconfiguration of the system containing this information or because a third party published it. Examples of such sources are WikiLeaks [11] or any data leaks that are available on the Internet. This thesis considers this type of information as OSINT. For example, discovering a vulnerability in a system is recognized as OSINT, exploiting this vulnerability to bypass the security of the system and gain some information from the inside is not.

Automated Collection of Open Source Intelligence

Better Market Intelligence with Smart Search Anaging Uncertainty and Risk in Business Requires a 1 Mcomprehensive Market Intelligence Approach

Competitive Intelligence: Systematic Collection and Analysis of Information

Open Source Intelligence and Osint Applications

Market Intelligence Surveillance Market Intelligence, Surveillance

BIS Quarterly Review September 2007 International Banking and Financial Market Developments

Introduction to Competitive Intelligence

January 26, 2021 | for Informational Purposes Only This Is

Market Intelligence in Large Companies

Market Intelligence Utilization by Small Food Companies: an Application of the Grounded Theory Method in Exploratory Research Aaron J

Basel “IV”: What’S Next for Banks? Implications of Intermediate Results of New Regulatory Rules for European Banks

Oaklins PE Newsletter Oct2020.Pdf

Cyber Risk Toolkit