Open-Source Measurement of Fast-Flux Networks While
Total Page:16
File Type:pdf, Size:1020Kb
Open-source Measurement of Fast-flux Networks While Considering Domain-name Parking Leigh B. Metcalf Dan Ruef Software Engineering Institute Software Engineering Institute Carnegie Mellon University Carnegie Mellon University Jonathan M. Spring Software Engineering Institute Carnegie Mellon University Abstract 1 Introduction Background: Fast-flux is a technique malicious ac- Fast-flux service networks were first reported in tors use for resilient malware communications. In 2007, identified as “a network of compromised com- this paper, domain parking is the practice of assign- puter systems with public DNS records that are con- ing a nonsense location to an unused fully-qualified stantly changing, in some cases every few minutes” domain name (FQDN) to keep it ready for “live” [25, §1]. Criminals use the technique to “evade iden- use. Many papers use “parking” to mean typosquat- tification and to frustrate law enforcement and anti- ting for ad revenue. However, we use the original crime efforts aimed at locating and shutting down meaning, which was relevant because it is a poten- web sites” that are used for abuse or illegal purposes tially confounding behavior for detection of fast-flux. [13, p. 2]. Internet-wide fast-flux networks and the extent to Despite this long history, and a variety of pub- which domain parking confounds fast-flux detection lications on detecting fast flux, there is no main- have not been publicly measured at scale. tained, open-source tool that can detect it at Aim: Demonstrate a repeatable method for open- scale. The Honeynet Project’s own tool for the source measurement of fast-flux and domain park- purpose, Tracker (http://honeynet.org/project/ ing, and measure representative trends over 5 years. Tracker), has a defunct homepage. Tools from the Method: Our data source is a large passive-DNS col- time, such as ATLAS [20] and FluXOR [21], han- lection. We use an open-source implementation that dle on the order of 400 domains. Our tool, Analysis identifies suspicious associations between FQDNs, Pipeline, handles networks on the order of 1 million IP addresses, and ASNs as graphs. We detect park- fully-qualified domain names (FQDN, hereafter sim- ing via a simple time-series of whether a FQDN ad- ply “domain” if the usage is unambiguous). Pipeline vertises itself on IETF-reserved private IP space and simultaneously tracks other network behavior, such public IP space alternately. Whitelisting domains as network flow records. Therefore Pipeline can de- that use private IP space for encoding non-DNS re- tect when a host connects to an IP address in the sponses (e.g. blacklist distributors) is necessary. fast-flux network in near-real time, for example. Results: Fast-flux is common; usual daily values We also measure a phenomenon mentioned but are 10M IP addresses and 20M FQDNs. Domain not measured in some older fast-flux detection pa- parking, in our sense, is uncommon (94,000 unique pers: domain name parking. When a domain is FQDNs total) and does not interfere with fast- parked on an IP address, the IP address to which flux detection. Our open-source tool works well at the domain resolves is inactive or otherwise not internet-scale. controlled by the domain owner. Parking is com- Discussion: Real-time detection of fast-flux net- mon practice when a user first registers an effective works could help defenders better interrupt them. second-level domain (eSLD) – the registrar supplies With our implementation, a resolver could poten- a nonsense IP address to prevents DNS errors. How- tially block name resolutions that would add to a ever, this parking pattern is distinctive and simple. known flux network if completed, preventing even We look for other, suspicious patterns. the first connection. Parking is a poor indicator of There are multiple distinct senses of the term “do- malicious activity. main parking,” and our topic is not synonymous [Distribution Statement A] Approved for public release and unlimited distribution. with any other study of which we are aware. Domain method between our parking and fast flux measure- parking on private IP address space is, however, a ments, which primarily is the passive DNS data relatively old phenomenon; it is mentioned in some source and our open-source tool Analysis Pipeline. fast-flux identification algorithm studies as an obsta- We present our measurement method for FQDNs cle [35, 14]. This older usage of “domain parking” is that exhibit parking on private IP address space our topic of study We use domain parking on private in Section 2.1. Section 2.2 describes our fast-flux IP address space to differentiate it from the newer measurement methods. Section3 presents the full usage [2, 34] that more accurately is domain parking results. Section4 interprets and discusses these re- on routeable IP addresses for advertisement revenue sults. generation. Two of the first studies on parking domains for il- 2 Method licit ad revenue find large-scale use of 4 million to 8 million domains [2, 34]. However, from the authors’ Our measurements of domain parking of private IP description this appears to be more like typosquat- address space and fast-flux networks use different ting (as described in Szurdi et. al. [31]) than res- algorithms over the same data set and implemented olution error suppression. We are not studying ty- using the same open-source tool. We describe the posquatting or skimming ad revenue off user typos. common measurement period, data, and tool here. Domain parking of the sort we study is a strategy for The methods specific to each parking and fast-flux suppressing domain resolution errors, likely used to measurement are described in the following subsec- keep command and control infrastructure stealthy. tions. The domain name system permits a variety of dif- We measure activity during over five years of pas- ferent resiliency mechanisms for distributed archi- sive DNS data, from January 1, 2012 to June 30, tectures. Often these have legitimate uses, but ma- 2017. The data source, the Security Information licious actors are equally able to adopt successful Exchange (SIE), has been demonstrated to be rea- techniques. Fast-flux and domain parking of private sonably representative of the global Internet with IP space are both candidates for such abuse. ICANN a small North American collection bias [29]. This responded to the security concerns from fast-flux is high-volume passive DNS data, as well as be- networks in 2009 by stating there would be no policy ing representative. Each month, the unique FQDNs response [15, p. 10]. Thus, of the six mitigation op- observed range between 550 million and 1 billion. tions outlined in the original Honeynet analysis, five With the relatively small and stable zone .edu, the are unlikely or inconsistently applied because they data source is sufficiently represetative to recon- can only be enacted by ISPs or Registrars. The last, struct 93% of the zone in five weeks [27]. “passive DNS harvesting/monitoring to identify A We use our own data filtering and packing tools, or NS records advertised” as part of fast-flux net- independent from the SIE database. About 35-40 works [25, §10], is the approach we implement with GB of data is ingested daily in compressed nmsgtool Analysis Pipeline. Our detection method is inspired format [10], including source DNS server and pre- by the Mannheim score [12]. cise time range the response was valid. Unique re- We find a mixture of positive and negative results. source record sets (RRsets) are extracted for each On one hand, our measurement of fast-flux networks 24-hour day, with a cutoff of 0000 UTC. We store confirms our hypothesis that this behavior remained just the fields for rname, TTL, type, and rdata. prevalent. As one negative example, domain park- The nmsg data canonicalizes rdata, so this field ing on private IP address space is not worth much is sorted set of all rdata in a single DNS message concern, for fast-flux network detection or otherwise. for a rname,type,class triple. The rname field is Negative results are important and useful in shap- label-wise reversed, so www.example.com becomes ing future work. Publication bias has been a docu- com.example.www; this makes sorting and lookup mented concern in medical literature for 30 years [8]. easier, as the TLD is usually a more important Despite this attention, publication of negative re- key. The RRsets are then simply sorted and unique sults has generally dwindled across disciplines. The RRsets stored per day. When compressed with stan- relative publication frequency of positive results over dard tools such as bzip or gzip, this ASCII storage negative results grew by 22% from 1990 to 2007 [9]. format takes about 5 GB per day. We expect such publication bias away from negative The rationale for this storage method is similar results is a contributing factor to why there seems to that for why SiLK, a netflow analysis tool suite, to be no public measurement of this phenomenon. stores flow in time-sorted, partitioned flat files via Section2 discusses the common elements of the the file system rather than in a database [32]. We are [Distribution Statement A] Approved for public release and unlimited distribution. interested in long trends, or retrospective analysis of Algorithm 1 Analysis Pipeline command-line poorly understood past events. This is different from the use case of many passive DNS users, who are /usr/local/sbin/pipeline \ --site-config-file=/usr/local/share/ looking for keywords indicating abuse of particular silk/silk.conf \ brand names. Our storage format provides details on --alert-log-file=~/AlertLog.txt \ what domains resolved to on particular days. This --aux-alert-file=~/AuxLog.txt \ -- ipfix \ allows us to see interleaving changes in domain-IP --time-is-clock \ mappings and large gaps of inactivity that are not --configuration=~/parking.conf \ possible in a database that only stores first- and last- --name-files input_files_list seen times.