Locating Spambots on the Internet
Total Page:16
File Type:pdf, Size:1020Kb
BOTMAGNIFIER: Locating Spambots on the Internet Gianluca Stringhinix, Thorsten Holzz, Brett Stone-Grossx, Christopher Kruegelx, and Giovanni Vignax xUniversity of California, Santa Barbara z Ruhr-University Bochum fgianluca,bstone,chris,[email protected] [email protected] Abstract the world-wide email traffic [20], and a lucrative busi- Unsolicited bulk email (spam) is used by cyber- ness has emerged around them [12]. The content of spam criminals to lure users into scams and to spread mal- emails lures users into scams, promises to sell cheap ware infections. Most of these unwanted messages are goods and pharmaceutical products, and spreads mali- sent by spam botnets, which are networks of compro- cious software by distributing links to websites that per- mised machines under the control of a single (malicious) form drive-by download attacks [24]. entity. Often, these botnets are rented out to particular Recent studies indicate that, nowadays, about 85% of groups to carry out spam campaigns, in which similar the overall spam traffic on the Internet is sent with the mail messages are sent to a large group of Internet users help of spamming botnets [20,36]. Botnets are networks in a short amount of time. Tracking the bot-infected hosts of compromised machines under the direction of a sin- that participate in spam campaigns, and attributing these gle entity, the so-called botmaster. While different bot- hosts to spam botnets that are active on the Internet, are nets serve different, nefarious goals, one important pur- challenging but important tasks. In particular, this infor- pose of botnets is the distribution of spam emails. The mation can improve blacklist-based spam defenses and reason is that botnets provide two advantages for spam- guide botnet mitigation efforts. mers. First, a botnet serves as a convenient infrastructure In this paper, we present a novel technique to support for sending out large quantities of messages; it is essen- the identification and tracking of bots that send spam. tially a large, distributed computing system with mas- Our technique takes as input an initial set of IP addresses sive bandwidth. A botmaster can send out tens of mil- that are known to be associated with spam bots, and lions of emails within a few hours using thousands of learns their spamming behavior. This initial set is then infected machines. Second, a botnet allows an attacker “magnified” by analyzing large-scale mail delivery logs to evade spam filtering techniques based on the sender to identify other hosts on the Internet whose behavior is IP addresses. The reason is that the IP addresses of some similar to the behavior previously modeled. We imple- infected machines change frequently (e.g., due to the ex- mented our technique in a tool, called BOTMAGNIFIER, piration of a DHCP lease, or to the change in network and applied it to several data streams related to the deliv- location in the case of an infected portable computer). ery of email traffic. Our results show that it is possible Moreover, it is easy to infect machines and recruit them to identify and track a substantial number of spam bots as new members into a botnet. This means that black- by using our magnification technique. We also perform lists need to be updated constantly by tracking the IP ad- attribution of the identified spam hosts and track the evo- dresses of spamming bots. lution and activity of well-known spamming botnets over Tracking spambots is challenging. One approach to time. Moreover, we show that our results can help to im- detect infected machines is to set up spam traps. These prove state-of-the-art spam blacklists. are fake email addresses (i.e., addresses not associated with real users) that are published throughout the Inter- 1 Introduction net with the purpose of attracting and collecting spam messages. By extracting the sender IP addresses from Email spam is one of the open problems in the area of the emails received by a spam trap, it is possible to ob- IT security, and has attracted a significant amount of tain a list of bot-infected machines. However, this ap- research over many years [11, 26, 28, 40, 42]. Unso- proach faces two main problems. First, it is likely that licited bulk email messages account for almost 90% of only a subset of the bots belonging to a certain botnet will send emails to the spam trap addresses. Therefore, to be a set of email messages that share a substantial the analysis of the messages collected by the spam trap amount of content and structure (e.g., a spam campaign can provide only a partial view of the activity of the bot- might involve the distribution of messages that promote net. Second, some botnets might only target users lo- a specific pharmaceutical scam). cated in a specific country (e.g., due to the language used Input datasets. At a high level, our approach takes two in the email), and thus a spam trap located in a different datasets as input. The first dataset contains the IP ad- country would not observe those bots. dresses of known spamming bots that are active during Other approaches to identify the hosts that are part of a certain time period (we call this time period the obser- a spamming botnet are specific to particular botnets. For vation period). The IP addresses are grouped by spam example, by taking control of the command & control campaign. That is, IP addresses in the same group sent (C&C) component of a botnet [21, 26], or by analyzing the same type of messages. We refer to these groups of the communication protocol used by the bots to interact IP addresses as seed pools. The second dataset is a log with other components of the infrastructure [6, 15, 32], of email transactions carried out on the Internet during it is possible to enumerate (a subset of) the IP addresses the same time period. This log, called the transaction of the hosts that are part of a botnet. However, in these log, contains entries that specify that, at a certain time, cases, the results are specific to the particular botnet that IP address C attempted to send an email message to IP is being targeted (and, typically, the type of C&C used). address S. The log does not need to be a complete log In this paper, we present a novel approach to identify of every email transaction on the Internet (as it would be and track spambot populations on the Internet. Our am- unfeasible to collect this information). However, as we bitious goal is to track the IP addresses of all active hosts will discuss later, our approach becomes more effective that belong to every spamming botnet. By active hosts, as this log becomes more comprehensive. we mean hosts that are online and that participate in spam campaigns. Comprehensive tracking of the IP addresses Approach. In the first step of our approach, we search belonging to spamming botnets is useful for several rea- the transaction log for entries in which the sender IP ad- sons: dress is one of the IP addresses in the seed pools (i.e., the known spambots). Then, we analyze these entries • Internet Service Providers can take countermea- and generate a number of behavioral profiles that capture sures to prevent the bots whose IP addresses reside the way in which the hosts in the seed pools sent emails in their networks from sending out email messages. during the observation period. • Organizations can clean up compromised machines In the second step of the approach, the whole trans- in their networks. action log is searched for patterns of behavior that are • Existing blacklists and systems that analyze similar to the spambot behavior previously learned from network-level features of emails can be improved the seed pools. The hosts that behave in a similar man- by providing accurate information about machines ner are flagged as possible spamming bots, and their IP that are currently sending out spam emails. addresses are added to the corresponding magnified pool. • By monitoring the number of bots that are part of In the third and final step, heuristics are applied to re- different botnets, it is possible to guide and support duce false positives and to assign spam campaigns (and mitigation efforts so that the C&C infrastructures the IP addresses of bots) to specific botnets (e.g., Rus- of the largest, most aggressive, or fastest-growing tock [5], Cutwail [35], or MegaD [4, 6]). botnets are targeted first. We implemented our approach in a tool, called BOT- Our approach to tracking spamming bots is based on MAGNIFIER. In order to populate our seed pools, we the following insight: bots that belong to the same bot- used data from a large spam trap set up by an Internet net share the same C&C infrastructure and the same code Service Provider (ISP). Our transaction logs were con- base. As a result, these bots will feature similar behavior structed by running a mirror for Spamhaus, a popular when sending spam [9, 40, 41]. In contrast, bots belong- DNS-based blacklist. Note that other sources of infor- ing to different spamming botnets will typically use dif- mation can be used to either populate the seed pools or ferent parameters for sending spam mails (e.g., the size to build a transaction log. As we will show, BOTMAGNI- of the target email address list, the domains or countries FIER also works for transaction logs extracted from net- that are targeted, the spam contents, or the timing of their flow data collected from a large ISP’s backbone routers.