Blocklist Babel: on the Transparency and Dynamics of Open Source Blocklisting
Total Page:16
File Type:pdf, Size:1020Kb
1 Blocklist Babel: On the Transparency and Dynamics of Open Source Blocklisting Alvaro´ Feal, Pelayo Vallina, Julien Gamba, Sergio Pastrana, Antonio Nappa, Oliver Hohlfeld, Narseo Vallina-Rodriguez, and Juan Tapiador Abstract—Blocklists constitute a widely-used Internet security are protecting an estimated 3,126,410,000 user mailboxes [1]. mechanism to filter undesired network traffic based on IP/domain Similarly, MISP (Malware Intelligence Sharing Platform [2]), reputation and behavior. Many blocklists are distributed in open a popular open source threat intelligence platform, reports that source form by threat intelligence providers who aggregate and process input from their own sensors, but also from third- it is used by more than 6,000 organizations worldwide. Beyond party feeds or providers. Despite their wide adoption, many blocklists, other feeds can be used by applications to block or open-source blocklist providers lack clear documentation about restrict access. One example are feeds of Tor exit node IPs, their structure, curation process, contents, dynamics, and inter- e.g., used by Wikipedia to block users of the Tor anonymity relationships with other providers. In this paper, we perform network from editing Wikipedia due to abuses in the past [3]. a transparency and content analysis of 2,093 free and open source blocklists with the aim of exploring those questions. To Blocklists can be proprietary (i.e., commercial) or open that end, we perform a longitudinal 6-month crawling campaign source. The former are often available through rate-limited, yielding more than 13.5M unique records. This allows us to shed license-based, or pay-per-use mechanisms and are maintained light on their nature, dynamics, inter-provider relationships, and by for-profit companies specialized in threat intelligence. On transparency. Specifically, we discuss how the lack of consensus the other hand, we define open source blocklists as lists that on distribution formats, blocklist labeling taxonomy, content focus, and temporal dynamics creates a complex ecosystem that are openly and freely available for anyone to collect and use. complicates their combined crawling, aggregation and use. We The ecosystem of open source blocklists is heterogeneous, also provide observations regarding their generally low overlap as with different types of providers, from for-profit companies well as acute differences in terms of liveness (i.e., how frequently that make publicly available—often partially—some of their records get indexed and removed from the list) and the lack of data feeds, to individual users or organizations that publish documentation about their data collection processes, nature and intended purpose. We conclude the paper with recommendations their own collected and curated data. One key aspect of this in terms of transparency, accountability, and standardization. diversity is that different providers follow different methods to collect, curate, maintain, and label their blocklists. Yet Index Terms—Security Management, Functional Areas; Net- work Monitoring and measurements, Methods; Internet Connec- openness does not necessarily imply transparency. Often times, tivity and Internet Access Services, Service Management. their internal data collection and sanitization processes are not sufficiently documented, thus impeding their optimal usage and, ultimately, their widespread adoption. We elaborate on I. INTRODUCTION these issues below: Network management practices often rely on data feeds • Data sources: While some providers operate their own assigning labels to network entities for access control, e.g., sensors and honeypots to detect malicious activities, others feeds provided by threat intelligence platforms to identify and might aggregate records from other providers’ blocklists. block content regarded as malicious, untrustworthy, or simply Yet, many blocklist providers do not document their data bad reputed. Such feeds are commonly regarded as blocklists sources. This impedes blocklist consumers from assessing (or blacklists) and can contain a variety of actionable records, the limitations, representativity, applicability, and scope of including IP addresses, hostnames, URLs, TLS certificate a given feed. Understanding the nature of the records con- fingerprints, or file hashes. The effectiveness of blocklist- tained in a blocklist and the potential overlap between dif- ing to raise the bar against unwanted behavior, traffic, and ferent providers is instrumental for an optimal exploitation, content has driven a steady growth in the threat intelligence particularly for end-users combining multiple feeds into a sharing ecosystem, with a variety of products and platforms single, consolidated database. When open source blocklist streamlining the aggregation, enforcement, and distribution of providers fail to document their data sources, users might actionable data. Even if the market penetration of blocklisting end up aggregating lists compiled for different purposes or is hard to know, Spamhaus–one of the oldest threat intelligence that are indeed indexing the same entries, possibly under providers–states that as of October 6, 2020 their blocklists different tags or labels. • Data curation: Blocklist providers should make available a A.´ Feal, P. Vallina, J. Gamba and N. Vallina-Rodriguez are with IMDEA Networks Institute. clear description of the methodology followed to guarantee A.´ Feal, P. Vallina, J. Gamba, S. Pastrana, A. Nappa and J. Tapiador are that the data they provide is sound and does not include false with Universidad Carlos III de Madrid. positives or outdated records. For instance, a given IP that A. Nappa is with U.C. Berkeley O. Hohlfeld is with Brandenburg University of Technology. was part of a botnet in the past can be benign later down the N. Vallina-Rodriguez is with ICSI. road. Similarly, when new malicious indicators appear, they 2 should be included in well-maintained blocklists as soon as categorization processes can lead to problems to users that try possible. For those blocklists feeding from each other, record to improve their protection by aggregating lists from different propagation across lists might negatively contribute to the providers. This is an under-researched area that deserves a dissemination of false positives, making it harder to remove closer examination given the widespread use of open source records once they are detected as such.1 As a result, the blocklists in operational environments. update frequency of lists, including the addition or removal To address the challenges discussed above, in this paper we of records, is an important characteristic to measure their run an extensive campaign of identification and crawling of liveness and data curation processes. 2,093 open source blocklists from 69 providers (Section III). • Record and blocklist labeling: Open source blocklists Our dataset includes feeds from prominent providers (e.g., are widely heterogeneous in terms of the type of records MISP or Spamhaus) and blocklists used for different cases and contained in their feeds and the potentially harmful behavior applications: from identifying users from anonymity networks that they try to protect against (i.e., tracking, phishing, or to blocking malicious content. Our large-scale and diverse malware). Yet, many providers publish their feeds without dataset allows us to gain a unique view to investigate fun- sufficiently describing their actual purpose or application. damental questions related to the nature, dynamics, pitfalls, While some platforms rely on public taxonomies to label transparency and purpose of the open-source blocklist ecosys- records in their list, many blocklists simply provide a custom tem. Our paper contributions are as follows: label, if any. Even for those that try to label the nature • We inspect the ecosystem of open source blocklists pro- of their data and their data sources, there are substan- viders, analyzing their characteristics and how transparent tial methodological differences (and subjective perceptions) they are (Section IV). We also study the differences across across providers when defining their records. For instance, blocklists in terms of the type of records that they include labeling differences often come down to a granularity issue, (Section V). with some lists using generic labels such as “malicious,” • We analyze the dynamics of the blocklists in our dataset by while others offer a specialized classification system for studying their changes (i.e., record additions and removals) malicious types. These differences in terms of labeling and update rates over time (Section VI). We observe that methodology and strategy are problematic because (i) they 30% of the lists never change and that 7% change at result in instances where a given entry appears in blocklists least daily. In the case of those blocklists that are updated, that have different purposes according to their providers; 12% only add entries, 32% only sporadically remove them and (ii) complicate their comparison and aggregation into (possibly due to sanitization efforts), and 56% both add and more comprehensive threat intelligence feeds. remove records. • We study the relationships between blocklists in terms of The aforementioned reasons call for an empirical analysis content overlap (Section VII) and how records propagate and measurement of the soundness, freshness, dissemination, among multiple blocklists over time (Section VIII). These and nature of the records present in open source blocklists,