Spamming Botnets: Signatures and Characteristics
Total Page:16
File Type:pdf, Size:1020Kb
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten+,IvanOsipkov+ Microsoft Research, Silicon Valley +Microsoft Corporation {yxie,fangyu,kachan,rina,ghulten,ivano}@microsoft.com ABSTRACT botnet infection and their associated control process [4, 17, 6], little In this paper, we focus on characterizing spamming botnets by effort has been devoted to understanding the aggregate behaviors of leveraging both spam payload and spam server traffic properties. botnets from the perspective of large email servers that are popular Towards this goal, we developed a spam signature generation frame- targets of botnet spam attacks. work called AutoRE to detect botnet-based spam emails and botnet An important goal of this paper is to perform a large scale analy- membership. AutoRE does not require pre-classified training data sis of spamming botnet characteristics and identify trends that can or white lists. Moreover, it outputs high quality regular expression benefit future botnet detection and defense mechanisms. In our signatures that can detect botnet spam with a low false positive rate. analysis, we make use of an email dataset collected from a large Using a three-month sample of emails from Hotmail, AutoRE suc- email service provider, namely, MSN Hotmail. Our study not only cessfully identified 7,721 botnet-based spam campaigns together detects botnet membership across the Internet, but also tracks the with 340,050 unique botnet host IP addresses. sending behavior and the associated email content patterns that are Our in-depth analysis of the identified botnets revealed several directly observable from an email service provider. Information interesting findings regarding the degree of email obfuscation, prop- pertaining to botnet membership can be used to prevent future ne- erties of botnet IP addresses, sending patterns, and their correlation farious activities such as phishing and DDoS attacks. Understand- with network scanning traffic. We believe these observations are ing the email sending behavior of botnets can facilitate the devel- useful information in the design of botnet detection schemes. opment of new botnet detection techniques. Our investigation is based on a novel framework called AutoRE that identifies botnet hosts by generating botnet spam signatures Categories and Subject Descriptors from emails. AutoRE is motivated in part by the recent success C.2.3 [Computer Communication Networks]: Network Opera- of signature based worm and virus detection systems (e.g., [12, tions—network management; C.2.0 [Computer Communication 21, 16, 15, 13]). The framework is based on the premise that botnet Networks]: General—security and protection spam emails are often sent in an aggregate fashion, resulting in con- tent prevalence similar to the worm propagation case. In particular, General Terms we focus primarily on URLs embedded in email content because they form the most critical part of spam emails – URLs play an Algorithms, Measurement, Security important role in directing users to phishing Web pages or targeted product Web sites [2] 1. Keywords However, the following two observations make it challenging to derive URL signatures that distinguish botnet spam from others. Spam, botnet, regular expression, signature generation First, spam emails often contain multiple URLs, some of which are legitimate and very general (e.g., http://www.w3.org). The 1. INTRODUCTION mixture of legitimate and spam URLs in an email requires us to Botnets have been widely used for sending spam emails at a large clearly separate them. Second, spammers deliberately add random- scale [14, 4, 19, 24]. By programming a large number of distributed ness into URLs to evade detection. Therefore, sifting through poly- bots, spammers can effectively transmit thousands of spam emails morphic URLs to identify common patterns is a critical task. in a short duration. To date, detecting and blacklisting individual AutoRE addresses the first challenge by iteratively selecting spam bots is commonly regarded as difficult, due to both the transient URLs based on the distributed yet bursty property of botnets-based nature of the attack and the fact that each bot may send only a spam campaigns. AutoRE does not require labeled data or whitelists, few spam emails. Furthermore, despite the increasing awareness of a common necessity in most previous solutions. AutoRE further outputs regular expression signatures that are different from tradi- tional worm signatures that consist of either fixed strings or token conjunctions (token1.*token2.*token2). Compared with complete Permission to make digital or hard copies of all or part of this work for URL (fixed string) based signatures, regular expression signatures personal or classroom use is granted without fee provided that copies are are more robust and can detect 10 times more spam emails. Com- not made or distributed for profit or commercial advantage and that copies pared with token conjunction based signatures, regular expression bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific 1 permission and/or a fee. Based on an analysis of sampled emails sent to Hotmail, we found SIGCOMM’08, August 17–22, 2008, Seattle, Washington, USA. that 74.1% of spam emails contained at least one URL (with the Copyright 2008 ACM 978-1-60558-175-0/08/08 ...$5.00. remainder mostly geared towards campaigns for penny stocks) signatures can significantly reduce the false positive rate of detect- referred to as Command and Control (C&C) servers. Due to the ing polymorphic URLs (by 10 to 30 times in our experiments). increased use of botnets for launching large-scale network attacks, Furthermore, AutoRE uses the generated spam URL signatures several recent studies have looked at different aspects of bot activ- to group emails into spam campaigns, where a campaign refers to ities: infection process [6], communication channels between bots a targeted spam effort to a single product or service. In this paper, and C&C servers [4, 17], and propagation strategies [5, 10, 11]. we identify spam campaigns originating from botnets. Using three Identification and prevention of email spam that originate from months of sampled emails from Hotmail, AutoRE successfully de- botnets is the primary focus of this paper. In this problem space, tected 7,721 spam campaigns that originated from 340,050 distinct Ramachandran et al. [19] performed a large scale study of the network- botnet host IP addresses spanning 5,916 ASes. Below, we briefly behavior of spammers, providing strong evidence that botnets are summarize several desirable features of AutoRE: commonly used as platforms for sending spam. Anderson et al. • Low false positive rate: Using AutoRE signatures, we identi- studied interesting characteristics of Internet scam infrastructures fied 580,466 spam emails with a false positive rate of 0.002. using Spamscatter [2], a system that analyzes spam email URLs. AutoRE’s false positive rate in detecting botnet hosts is less Webb et al. studied the link between spam emails and Web spam than 0.005. using a large Web spam corpus [23]. More recently, Ramachan- • Ability to detect stealthy botnet-based spam: AutoRE detects dran et al. proposed ways to infer botnet membership and identify 16-18% of spam that bypassed well known blacklists (e.g., spammers by monitoring queries to DNSBL [18] and by clustering Spamhaus [22]) that are deployed by major mail providers. email servers based on their target email destination domains [20]. • Ability to detect frequent domain modifications: Using domain- These approaches have all provided insight into various aspects agnostic signatures, AutoRE is able to capture spam URLs of spamming activities and successfully explored opportunities for even if spammers adopt new domains. This enables us to monitoring different spam traffic. In contrast, our work focuses on filter 15 times more spam than using domain specific signa- the problem of not just detecting botnet hosts, but also correctly tures. grouping them based on spam campaigns. We hope such a collec- tive view can shed light on how botnets operate and evolve as an Another important contribution of our work is an in-depth analy- aggregate to facilitate the detection of future attacks. We adopt the sis of identified spamming botnet characteristics and their activity generic perspective of an email server receiving incoming traffic trends. Our key findings include: destined to a single domain without additional communication to (1) By comparing botnet statistics in July 2007 to those obtained other infrastructure servers or services. Any solution in such a set- in Nov 2006, we noticed that the number of spam campaigns dou- ting could potentially be adopted autonomously by email service bled, while the total number of botnet IPs increased by only 10%. providers or ISPs. This indicates that botnets are becoming an increasingly popular In a similar context, Zhuang et al. showed that the similarity of mechanism for spam delivery and that one botnet host is involved email texts can help identify botnet-based spam campaigns [25]. in multiple attacks. Li and Hsish performed a measurement study where they exam- (2) Even though a spam campaign directs email users to the ined various content features including spam URL links [14]. They same/similar set of destination Web pages, the text content of the found that spam emails with identical URLs are highly