From Throw-Away Traffic to Bots
Total Page:16
File Type:pdf, Size:1020Kb
From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware Manos Antonakakis‡,∗, Roberto Perdisci†,∗, Yacin Nadji∗, Nikolaos Vasiloglou‡, Saeed Abu-Nimeh‡, Wenke Lee∗ and David Dagon∗ ‡Damballa Inc., †University of Georgia {manos,nvasil,sabunimeh}@damballa.com , [email protected] ∗Georgia Institute of Technology {yacin.nadji, wenke}@cc.gatech.edu, [email protected] Abstract 1 Introduction Many botnet detection systems employ a blacklist of Botnets are groups of malware-compromised ma- known command and control (C&C) domains to detect chines, or bots, that can be remotely controlled by an bots and block their traffic. Similar to signature-based attacker (the botmaster) through a command and control virus detection, such a botnet detection approach is static (C&C) communication channel. Botnets have become because the blacklist is updated only after running an ex- the main platform for cyber-criminals to send spam, steal ternal (and often manual) process of domain discovery. private information, host phishing web-pages, etc. Over As a response, botmasters have begun employing domain time, attackers have developed C&C channels with dif- generation algorithms (DGAs) to dynamically produce a ferent network structures. Most botnets today rely on large number of random domain names and select a small a centralized C&C server, whereby bots query a prede- subset for actual C&C use. That is, a C&C domain is ran- fined C&C domain name that resolves to the IP address domly generated and used for a very short period of time, of the C&C server from which commands will be re- thus rendering detection approaches that rely on static ceived. Such centralized C&C structures suffer from the domain lists ineffective. Naturally, if we know how a do- single point of failure problem because if the C&C do- main generation algorithm works, we can generate the main is identified and taken down, the botmaster loses domains ahead of time and still identify and block bot- control over the entire botnet. net C&C traffic. The existing solutions are largely based To overcome this limitation, attackers have used P2P- on reverse engineering of the bot malware executables, based C&C structures in botnets such as Nugache [35], which is not always feasible. Storm [38], and more recently Waledac [39], Zeus [2], In this paper we present a new technique to detect ran- and Alureon (a.k.a. TDL4) [12]. While P2P botnets domly generated domains without reversing. Our insight provide a more robust C&C structure that is difficult to is that most of the DGA-generated (random) domains detect and take down, they are typically harder to imple- that a bot queries would result in Non-Existent Domain ment and maintain. In an effort to combine the simplicity (NXDomain) responses, and that bots from the same bot- of centralized C&Cs with the robustness of P2P-based net (with the same DGA algorithm) would generate sim- structures, attackers have recently developed a number ilar NXDomain traffic. Our approach uses a combination of botnets that locate their C&C server through automat- of clustering and classification algorithms. The cluster- ically generated pseudo-random domains names. In or- ing algorithm clusters domains based on the similarity in der to contact the botmaster, each bot periodically exe- the make-ups of domain names as well as the groups of cutes a domain generation algorithm (DGA) that, given machines that queried these domains. The classification a random seed (e.g., the current date), produces a list of algorithm is used to assign the generated clusters to mod- candidate C&C domains. The bot then attempts to re- els of known DGAs. If a cluster cannot be assigned to a solve these domain names by sending DNS queries un- known model, then a new model is produced, indicating til one of the domains resolves to the IP address of a a new DGA variant or family. We implemented a pro- C&C server. This strategy provides a remarkable level totype system and evaluated it on real-world DNS traffic of agility because even if one or more C&C domain obtained from large ISPs in North America. We report names or IP addresses are identified and taken down, the the discovery of twelve DGAs. Half of them are variants bots will eventually get the IP address of the relocated of known (botnet) DGAs, and the other half are brand C&C server via DNS queries to the next set of automat- new DGAs that have never been reported before. ically generated domains. Notable examples of DGA- based botnets (or DGA-bots, for short) are Bobax [33], ages throw-away traffic (i.e., unsuccessful DNS resolu- Kraken [29], Sinowal (a.k.a. Torpig) [34], Srizbi [30], tions) to (1) discover the rise of new DGA-based botnets, Conficker-A/B [26], Conficker-C [23] and Murofet [31]. (2) accurately detect bot-compromised machines, and (3) A defender can attempt to reverse engineer the bot mal- identify and block the active C&C domains queried by ware, particularly its DGA algorithm, to pre-compute the discovered DGA-bots. Pleiades achieves these goals current and future candidate C&C domains in order to by monitoring the DNS traffic in local networks, without detect, block, and even take down the botnet. However, the need for a large-scale deployment of DNS analysis reverse engineering is not always feasible because the bot tools required by prior work. malware can be updated very quickly (e.g., hourly) and Furthermore, while botnet detection systems that fo- obfuscated (e.g., encrypted, and only decrypted and exe- cus on network flow analysis [13, 36, 44, 46] or require cuted by external triggers such as time). deep packet inspection [10, 14] may be capable of de- In this paper, we propose a novel detection system, tecting compromised machines within a local network, called Pleiades, to identify DGA-based bots within a they do not scale well to the overwhelming volume of monitored network without reverse engineering the bot traffic typical of large ISP environments. On the other malware. Pleiades is placed “below” the local recursive hand, Pleiades employs a lightweight DNS-based moni- DNS (RDNS) server or at the edge of a network to mon- toring approach, and can detect DGA-based malware by itor DNS query/response messages from/to the machines focusing on a small fraction of all DNS traffic in an ISP within the network. Specifically, Pleiades analyzes DNS network. This allows Pleiades to scale well to very large queries for domain names that result in Name Error re- ISP networks, where we evaluated our prototype system. sponses [19], also called NXDOMAIN responses, i.e., do- This paper makes the following contributions: main names for which no IP addresses (or other resource records) exist. In the remainder of this paper, we refer • We propose Pleiades, the first DGA-based bot- to these domain names as NXDomains. The focus on net identification system that efficiently analyzes NXDomains is motivated by the fact that modern DGA- streams of unsuccessful domain name resolutions, bots tend to query large sets of domain names among or NXDomains, in large ISP networks to automati- which relatively few successfully resolve to the IP ad- cally identify DGA-bots. dress of the C&C server. Therefore, to automatically identify DGA domain names, Pleiades searches for rela- tively large clusters of NXDomains that (i) have similar • We built a prototype implementation of Pleiades, syntactic features, and (ii) are queried by multiple po- and evaluated its DGA identification accuracy over tentially compromised machines during a given epoch. a large labeled dataset consisting of a mix of NX- The intuition is that in a large network, like the ISP net- Domains generated by four different known DGA- work where we ran our experiments, multiple hosts may based botnets and NXDomains “accidentally” gen- be compromised with the same DGA-bots. Therefore, erated by typos or mis-configurations. Our experi- each of these compromised assets will generate several ments demonstrate that Pleiades can accurately de- DNS queries resulting in NXDomains, and a subset of tect DGA-bots. these NXDomains will likely be queried by more than one compromised machine. Pleiades is able to automat- • We deployed and evaluated our Pleiades prototype ically identify and filter out “accidental”, user-generated in a large production ISP network for a period of 15 NXDomains due to typos or mis-configurations. When months. Our experiments discovered twelve new Pleiades finds a cluster of NXDomains, it applies statis- DGA-based botnets and enumerated the compro- tical learning techniques to build a model of the DGA. mised machines. Half of these new DGAs have This is used later to detect future compromised ma- never been reported before. chines running the same DGA and to detect active do- main names that “look similar” to NXDomains resulting The remainder of the paper is organized as follows. from the DGA and therefore probably point to the botnet In Section 2 we discuss related work. We provide an C&C server’s address. overview of Pleiades in Section 3. The DGA discovery Pleiades has the advantage of being able to discover process is described in Section 4. Section 5 describes the and model new DGAs without labor-intensive malware DGA classification and C&C detection processes. We reverse-engineering. This allows our system to detect elaborate on the properties of the datasets used and the new DGA-bots before any sample of the related malware way we obtained the ground truth in Section 6. The ex- family is captured and analyzed. Unlike previous work perimental results are presented in Section 7 while we on DNS traffic analysis for detecting malware-related [4] discuss the limitations of our systems in Section 8.