Towards Automated Detection of Shadowed Domains
Total Page:16
File Type:pdf, Size:1020Kb
Don’t Let One Rotten Apple Spoil the Whole Barrel: Towards Automated Detection of Shadowed Domains Daiping Liu Zhou Li Kun Du University of Delaware ACM Member Tsinghua University [email protected] [email protected] [email protected] Haining Wang Baojun Liu Haixin Duan University of Delaware Tsinghua University Tsinghua University [email protected] [email protected] [email protected] ABSTRACT scammers set up phishing websites on domains resembling well- Domain names have been exploited for illicit online activities for known legitimate ones [38, 75]. In the past, Internet miscreants decades. In the past, miscreants mostly registered new domains mostly registered new domains to launch attacks. To mitigate the for their attacks. However, the domains registered for malicious threats, tremendous efforts [10, 14, 33, 41, 77] have been devoted purposes can be deterred by existing reputation and blacklisting in the last decade to construct reputation and blacklisting systems systems. In response to the arms race, miscreants have recently that can fend off malicious domains before visited by users. Allof adopted a new strategy, called domain shadowing, to build their these endeavors render it less effective to register new domains attack infrastructures. Specifically, instead of registering new do- for attacks. In response, miscreants have moved forward to more mains, miscreants are beginning to compromise legitimate ones sophisticated and stealthy strategies. and spawn malicious subdomains under them. This has rendered In fact, there is a newly emerging class of attacks adopted by almost all existing countermeasures ineffective and fragile because cybercriminals to build their infrastructure for illicit online activi- subdomains inherit the trust of their apex domains, and attackers ties, domain shadowing, where instead of registering new domains, can virtually spawn an infinite number of shadowed domains. miscreants infiltrate the registrant accounts of legitimate domains In this paper, we conduct the first study to understand and detect and spawn subdomains under them for malicious purposes. Domain this emerging threat. Bootstrapped with a set of manually confirmed shadowing is becoming increasingly popular due to its superior shadowed domains, we identify a set of novel features that uniquely ability to evade detection. The shadowed domains naturally inherit characterize domain shadowing by analyzing the deviation from the trust of a legitimate parent zone, and miscreants can even set their apex domains and the correlation among different apex do- up authentic HTTPS connections with Let’s Encrypt [59]. Even mains. Building upon these features, we train a classifier and apply worse, miscreants can create an infinite number of subdomains it to detect shadowed domains on the daily feeds of VirusTotal, a under many hijacked legitimate domains and rapidly rotate among large open security scanning service. Our study highlights domain them at no cost. This makes it quite challenging to keep blacklists shadowing as an increasingly rampant threat. Moreover, while pre- up-to-date and gather useful information for meaningful analysis. viously confirmed domain shadowing campaigns are exclusively While domain shadowing has been reported in public outlets like involved in exploit kits, we reveal that they are also widely exploited blogs.cisco.com [8, 40], most previous studies only elaborate on for phishing attacks. Finally, we observe that instead of algorith- sporadic cases collected in a short time through manual analysis. mically generating subdomain names, several domain shadowing It is still unclear how serious the threat is and how to address this cases exploit the wildcard DNS records. domain shadowing problem on a larger scale. In this paper, we conduct the first comprehensive study of do- main shadowing in the wild, and we present a novel system to au- 1 INTRODUCTION tomatically detect shadowed domains by addressing the following The domain name system (DNS) serves as one of the most funda- unique challenges. Shadowed domains by design do not present sus- mental Internet components and provides critical naming services picious registration information, and thus all detectors leveraging for mapping domain names to IP addresses. Unfortunately, it has these data [30, 33, 34] can be easily bypassed. Blindly blacklisting also been constantly abused by miscreants for illicit online activities. all sibling subdomains of shadowed domains is also infeasible in For instance, botnets exploit algorithmically generated domains to practice, since it can cause large amounts of collateral damage. Last circumvent the take-down efforts of authorities [11, 65, 86], and but not least, most suspicious DNS patterns identified in previous Permission to make digital or hard copies of all or part of this work for personal or studies do not work well in domain shadowing. For instance, Kopis classroom use is granted without fee provided that copies are not made or distributed [10] analyzes the collective features of all visitors to a domain. How- for profit or commercial advantage and that copies bear this notice and the full citation ever, our study has seen many shadowed domains being visited on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, only once, rendering the collective features insignificant. Such col- to post on servers or to redistribute to lists, requires prior specific permission and/or a lective features can be applied to malicious apex domains1 because fee. Request permissions from [email protected]. CCS ’17, October 30-November 3, 2017, Dallas, TX,USA © 2017 Association for Computing Machinery. 1An apex domain is also known as a bare/base/naked/root domain that is separated ACM ISBN 978-1-4503-4946-8/17/10...$15.00 from the top level domain by a dot, e.g., foo.com, and needs to be purchased from https://doi.org/10.1145/3133956.3134049 registrars. the domain registration cost will become unaffordable if an apex is in phishing attacks. Another interesting finding is that miscreants used only a few times. also exploit the wildcard DNS records to spawn shadowed domains. To bootstrap the design of our detector, we collect a set of 26,132 Roadmap. The remainder of this paper is organized as follows. confirmed shadowed domains under 4,862 distinct zones through Section 2 introduces the background of DNS and shadowed domains. manually searching and reviewing technical reports by security pro- Section 3 presents the design and extracted features of our detector. fessionals. Comparing them with legitimate subdomains, we find In Section 4, we validate the efficacy of our detector using labeled that the shadowed ones can be characterized and distinguished by datasets. We then conduct a large-scale analysis of the shadowed two dimensions. On one hand, shadowed domains usually exhibit domains in Section 5. Section 6 discusses the limitations of our deviant behaviors and are more isolated from those known-good detection approach. Finally, we survey related work in Section 7 subdomains under the same parent zone. For instance, most le- and conclude in Section 8. gitimate domains are hosted on reputable servers, which usually strictly restrict illicit content. Due to the nature of their criminal ac- 2 BACKGROUND tivity and their demand to evade detection and possible take-down, We give a brief overview of the domain system in the beginning shadowed domains have to be hosted on cheap and cybercriminal- of this section. Then, we describe the schema regarding domain friendly servers. This deviation serves as a prominent indicator shadowing attacks and use one real-world case identified by our of potential shadowed domains. On the other hand, miscreants detection system to walk through the attack flow. tend to exploit a set of shadowed domains under different parent zones within the same campaign. This can greatly increase the 2.1 Basics of Domain Names resilience and stealthiness of their infrastructure. However, such correlation also presents suspicious synchronous characteristics. Domain name structure. A domain name is presented in the For instance, shadowed domains in the same campaign usually structure of hierarchical tree (e.g., a.example.com), with each level appear and disappear at the same time. (e.g., example.com) associated with a DNS zone. For one DNS zone, Based on these observations, we develop a novel system, called there is a single manager that oversees the changes of domains Woodpecker, to automatically detect shadowed domains by inspect- within its territory and provides authoritative name-to-address ing the deviation of subdomains to their parent zones and the cor- resolution services through the DNS server. The top of the domain relation of shadowed domains among different zones. In particular, hierarchy is the root zone, which is usually represented as a dot. So we compose 17 features characterizing the usage, hosting, activity, far, the root zone is managed by ICANN, and there are 13 logical and name patterns of subdomains, based on the passive DNS data. root servers operated by 12 organizations. Below the root level is Five classifiers (Support Vector Machine, RandomForest, Logistic the top-level domain (TLD), a label after the rightmost dot in the Regression, Naive Bayes, and Neural Network) are then trained domain name. The commonly used TLDs are divided into three using these features. We achieve a 98.5% detection rate with an ap- groups, including generic TLDs (gTLDs) like .com, country-code proximately 0.1% false positive rate with a 10-fold cross-validation TLDs (ccTLDs) like .uk, and sponsored TLDs (sTLDs) like .jobs. when using RandomForest. Next to TLD is the second-level domain (2LD) (e.g., .example.com), Woodpecker is envisioned to be deployed in several scenarios, which can be directly registered from registrars (like GoDaddy) e.g., domain registrars and upper DNS hierarchy as a complement to if not yet occupied, in most cases. One exception occurs when Kopis [10], generating more accurate indicators about the ongoing both ccTLD and gTLD appear in the domain name, like .co.uk, cybercrimes.