Shades of Grey: on the Effectiveness of Reputation-Based “Blacklists”
Total Page:16
File Type:pdf, Size:1020Kb
Shades of Grey: On the effectiveness of reputation-based “blacklists” Sushant Sinha, Michael Bailey, and Farnam Jahanian Electrical Engineering and Computer Science Department University of Michigan, Ann Arbor, MI 48109 sushant, mibailey, farnam @umich.edu { } Abstract computers infected with tens of thousands of malware variants in the second half of 2007 alone. Malicious code, or malware, executed on compro- This scale and diversity, along with an increased mised hosts provides a platform for a wide variety of number of advanced evasion techniques such as poly- attacks against the availability of the network and the morphism have hampered existing detection and re- privacy and confidentiality of its users. Unfortunately, moval tools. The most popular of these, host-based the most popular techniques for detecting and pre- anti-virus software, is falling woefully behind–with venting malware have been shown to be significantly detection rates as low as 40% [11]. Admitting this fail- flawed [11], and it is widely believed that a significant ure to completely prevent infections, defenders have fraction of the Internet consists of malware infected looked at new ways to defend against large numbers machines [17]. In response, defenders have turned to of persistently compromised computers and the attacks coarse-grained, reputation-based techniques, such as they perform. One technique becoming increasingly real time blackhole lists, for blocking large numbers popular, especially in the network operation commu- of potentially malicious hosts and network blocks. In nity, is that of reputation-based blacklists. In these this paper, we perform a preliminary study of a type of blacklists, URLs, hosts, or networks are identified as reputation-based blacklist, namely those used to block containing compromised hosts or malicious content. unsolicited email, or spam. We show that, for the net- Real-time feeds of these identified hosts, networks, or work studied, these blacklists exhibit non-trivial false URLs are provided to organizations who then use the positives and false negatives. We investigate a number information to block web access, emails, or all activity of possible causes for this low accuracy and discuss to and from the malicious hosts or networks. Currently the implications for other types of reputation-based a large number of organizations provide these ser- blacklists. vices for spam detection (e.g., NJABL [3], SORBS [6], SpamHaus [8] and SpamCop [7]) and for intrusion de- 1 Introduction tection (e.g., DShield [15]). While these techniques have gained prominence, little is known about their ef- Current estimates of the number of compromised fectiveness or potential draw backs. hosts on the Internet range into the hundreds of mil- In this paper, we present a preliminary study on the lions [17]. Malicious code, or malware, executed on effectiveness of reputation-based blacklists. In partic- these compromised hosts provides a platform for at- ular we examine the most prevalent of these systems, tackers to perform a wide variety of attacks against those used for spam detection. Using an oracle, a networks (e.g., denial of service attacks) and attacks spam detector called SpamAssassin [1], we identify that affect the privacy and confidentiality of the end the spam received by a large academic network con- users (e.g., key-logging, phishing, spam) [10]. This sisting of 7,000 unique hosts, with millions of email ecosystem of malware is both varied and numerous– messages, over a period 10 days in June of 2008. a recent Microsoft survey reported tens of millions of We examine the effectiveness, in terms of false posi- 1 tives and negatives, of four blacklists, namely NJABL, positives and that many critical mail servers were SORBS, SpamHaus and SpamCop and provide an in- blacklisted, especially by SORBS. This included vestigation into the sources of the reported inaccuracy. 6 Google mail servers that sent significant amount While a preliminary study, this work offers several of ham to our network. novel contributions: This paper is structured as follows: Section 2 An investigation of email, spam, and spam tool presents background and related work on blacklists • behavior in the context of a large academic and Section 3 presents our approach to evaluating network. We found that roughly 80% of the blacklist effectiveness. Section 2 presents a prelimi- email messages received by our network were nary evaluation of the blacklists and we conclude in spam. The network level characteristics of spam Section 5. were also quite different when compared to the observed ham. For example, individual sources 2 Related Work contributed significantly to overall ham but the spam was distributed in small quantities across a Access control devices like firewalls enforce rep- large number of sources. Conversely, destinations utation that is statically decided. In recent years, of spam tend to be very targeted when compared more powerful dynamic reputation-based systems in to the ham. Using a small number of hand classi- the form of blacklists have evolved. A number of or- fied email mailboxes, we also evaluated our ora- ganizations support and generate dynamic blacklists. cle, SpamAssassin, to be quite effective with less These organizations include spam blacklist providers than 0.5% false positives and 5% false negatives like NJABL [3], SORBS [6], SpamHaus [8] and Spam- for the default threshold. Cop [7]. Ramachandran and Feamster [13] collected spam An analysis of the accuracy of four prevalent • by monitoring mails sent to an unused domain and per- spam blacklists. We found that the black lists formed a preliminary analysis of spammers. They ob- studied in our network exhibited a large false neg- served that the spamming sources are clustered within ative rate. NJABL had a false negative rate of the IP address space and some of these sources are 98%, SORBS had 65%, SpamCop had 35% and short lived. Instead of collecting spam on a single SpamHaus had roughly 36%. The false posi- domain, we monitored all emails on an academic net- tive rate of all blacklists were low except that of work, both spam and ham, using an accurate detector SORBS, which had an overall false positive rate SpamAssassin. of 10%. Spam blacklists providers set up a number of un- A preliminary study of the causes of inaccu- used email addresses called spamtraps. These spam- • racy and a discussion of the issues as they re- traps are not advertised to real users but are infiltrated late to reputation-based services. We found into spammer lists when they scrape the web look- that while blacklists agree significantly with each ing for email addresses. Then source IPs that have other over what is spam, a significant amount sent mails to more than a threshold number of spam- (21%) of the spam is not detected by any of these traps are blacklisted. Recently, new blacklist genera- lists, indicating that the blacklists may not have tion techniques have been proposed. Ramachandran visibility into a significant portion of spam space. et. al. [14] argue that blacklisting based on spamtraps Second, we found that many spamming sources is often late and incomplete. They proposed a new that went undetected sent very little spam to our method that blacklists source IPs based on their mail network and that 90% of the undetected sources sending patterns. DShield [15] aggregates intrusion were observed on the network for just 1 second. detection alerts and firewall logs from a large number This indicates that it is possible that these black- of organizations. It then publishes a common black- lists are not able to detect these low volume, short list that consists of source IPs and network blocks that lived spammers. Finally, we found that the black- cross a certain threshold of events. Zhang et. al. [20] lists rarely agreed with each other on their false argued that a common blacklist may contain entries 2 that are never used in an organization. So they pro- sulted with the blacklists. A number of spam detec- posed an approach to reduce the size of the black- tors can be used for our study. The two most popu- lists and possibly reduce the computational overhead lar and open source spam detectors are SpamAssas- in blacklist evaluation. Xie et. al. [19] have shown sin [1] and DSpam [4]. DSpam requires manual train- that a large number of IP addresses are dynamically ing of individual mail boxes and so we used SpamAs- assigned and mails from these IP addresses are mostly sassin in our experimental setup. SpamAssassin uses a spam. So they recommend adding dynamic IP ranges number of spam detectors and assigns scores for each into blacklists to reduce the false negatives. While detector. The total score for a message is computed these methods may be more effective, we only eval- by adding the score of all detectors that classified the uated production spam blacklists in our study. message as spam. If the total score exceeds the de- A number of papers have questioned the effective- fault threshold of 5.0, then the message is classified as ness of blacklists. Ramachandran et. al. [12] ana- spam. We used the default SpamAssassin configura- lyzed how quickly bobox infected hosts appeared in tion that came with the Gentoo Linux [2] distribution. the Spamhaus blacklists. They found that a large frac- We configured SpamAssassin with two additional de- tion of these hosts were not found in the blacklist. tection modules namely Pyzor [5] and Razor [9] for In this paper, we present the overall consequences of improving SpamAssassin accuracy. such incompleteness of blacklists. Finally, there has Blacklist lookups are done by reversing the IP been other innovative uses of blacklists. Venkatara- addressing, appending the blacklist zone (eg, com- man et. al. [16] presented a situation where spammers bined.njabl.org) and then making a DNS lookup.