Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces

Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces Roberto Perdiscia;b, Wenke Leea, and Nick Feamstera aCollege of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA bDamballa, Inc. Atlanta, GA 30308, USA [email protected], fwenke, [email protected] Abstract be used to detect future malware variants with low false positives and false negatives. We present a novel network-level behavioral malware Network-level signatures have some attractive proper- clustering system. We focus on analyzing the structural ties compared to system-level signatures. For example, similarities among malicious HTTP traffic traces gener- enforcing system-level behavioral signatures often re- ated by executing HTTP-based malware. Our work is quires the use of virtualized environments and expensive motivated by the need to provide quality input to algo- dynamic analysis [21, 34]. On the other hand, network- rithms that automatically generate network signatures. level signatures are usually easier to deploy because we Accordingly, we define similarity metrics among HTTP can take advantage of existing network monitoring in- traces and develop our system so that the resulting clus- frastructures (e.g., intrusion detection systems and alert ters can yield high-quality malware signatures. monitoring tools), and monitor a large number of ma- We implemented a proof-of-concept version of our chines without introducing overhead at the end hosts. network-level malware clustering system and performed experiments with more than 25,000 distinct malware The vast majority of malware needs a network con- samples. Results from our evaluation, which includes nection in order to perpetrate their malicious activities real-world deployment, confirm the effectiveness of the (e.g., sending spam, exfiltrating private data, download- proposed clustering system and show that our approach ing malware updates, etc.). In this paper, we focus on network-level can aid the process of automatically extracting net- behavioral clustering of HTTP-based mal- work signatures for detecting HTTP traffic generated by ware, namely, malware that uses the HTTP protocol as malware-compromised machines. its main means of communicating with the attacker or perpetrating their malicious intents. HTTP-based malware is becoming more prevalent. 1 Introduction For example, according to [20] the majority of spam The battle against malicious software (a.k.a. malware) botnets use HTTP to communicate with their command is becoming more difficult. Today’s malware writers and control (C&C) server. Also, from our own mal- commonly use executable packing [16] and other code ware database, we found that among the malware sam- obfuscation techniques to generate a large number of ples that show network activities, about 75% of them polymorphic variants of the same malware. As a con- generate some HTTP traffic. In addition, there is evi- sequence, anti-viruses (AVs) have a hard time keeping dence that Web-based “reusable” kits (or platforms) for their signature database up-to-date, and their AV scan- remote command of malware, and in particular botnets, ners often have many false negatives [26]. are available for sale on the Internet [14] (e.g., the C&C Although it is easy to create many polymorphic vari- Web kit for Zeus bots can be currently purchased for ants of a given malware sample, different variants of the about $700 [8]). same malware will exhibit similar malicious activities, Given a large dataset of malware samples and the ma- when executed. Behavioral malware clustering groups licious HTTP traffic they generate, our network-level be- malware variants according to similarities in their ma- havioral clustering system aims at unveiling similarities licious behavior. This process is particularly useful be- (or relationships) among malware samples that may not cause once a number of different variants of the same be captured by current system-level behavioral clustering malware have been identified and grouped together, it is systems [9, 10], thus offering a new point of view and easier to write a generic behavioral signature that can valuable information to malware analysts. Unlike pre- 1 vious work on behavioral malware clustering, our work proxy (whereby the TCP port used may vary), but have is motivated by the need to provide quality input to al- strong similarities in terms of the structure and sequence gorithms that automatically generate network signatures. of the HTTP queries they perform (e.g., because they rely Accordingly, we define similarity metrics among HTTP on the same C&C Web kit). Also, we develop our behav- traffic traces and develop our clustering system so that ioral clustering algorithm so that the results can be used the resulting clusters can yield high quality malware sig- to automatically generate network signatures for detect- natures. Namely, after clustering is completed, the HTTP ing malicious network activities, as opposed to system- traffic generated by malware samples in the same cluster level signatures. can be processed by an automatic signature generation Automatic generation of network signatures has been tool, in order to extract network signatures that model the explored in various previous work [23, 24, 29, 32, 33]. HTTP behavior of all the malware variants in that cluster. Most of these studies focused mainly on worm finger- An Intrusion Detection System (IDS) located at the edge printing. Different approaches have been proposed to of a network can in turn deploy such network signatures deal with generating signatures from a dataset of network to detect malware-related outbound HTTP requests. flows related to the propagation of different worms. In The main contributions of this paper are as follows: particular, Polygraph [24] applies clustering techniques • We propose a novel network-level behavioral mal- to try to separate worm flows belonging to different ware clustering system based on the analysis of worms, before generating the signatures. However, Poly- structural similarities among malicious HTTP traf- graph’s clustering algorithm is greedy and becomes pro- fic traces generated by different malware samples. hibitively expensive when dealing with the high number • We introduce a new automated method for ana- of malicious flows generated by a large dataset of differ- lyzing the results of behavioral malware clustering ent types of malware, as we will discuss in Section 6.2. based on a comparison with family names assigned Since behavioral malware clustering aims at efficiently to the malware samples by multiple AVs. clustering large datasets of different malware samples • We show that the proposed system enables accurate (including bots, adware, spyware, etc., beside Worms), and efficient automatic generation of network-level the clustering approaches proposed for worm fingerprint- malware signatures, which can complement tradi- ing are not suitable for this task. Compared with [24] and tional AVs and other defense techniques. other previous work on worm fingerprinting, we focus • We implemented a proof-of-concept version of our on clustering of different types of HTTP-based malware malware clustering system and performed experi- (not only worms) in an efficient manner. ments with more than 25,000 malware samples. Re- BotMiner [15], an anomaly-based botnet detection sults from our evaluation, which includes real-world system, applies clustering of network flows to detect deployment, confirm the effectiveness of the pro- the presence of bot-compromised machines within en- posed clustering system. terprise networks. BotMiner uses high-level statistics for clustering network flows, and is limited to detecting bot- 2 Related Work nets. On the other hand, in this paper we focus on the System-level behavioral malware clustering has been behavioral clustering of generic malware samples based recently studied in [9, 10]. In particular, Bayer et al. [10] on structural similarities among their HTTP traffic traces, proposed a scalable malware clustering algorithm based and on modeling the network behavior of the discovered on malware behavior expressed in terms of detailed sys- malware families by extracting network-level malware tem events. However, the network information they use detection signatures. is limited to high-level features such as the names of downloaded files, the type of protocol, and the domain 3 HTTP-Based Behavioral Clustering name of the server. Our work is different because we fo- The objective of our system is to find groups of mal- cus on the malicious HTTP traffic traces generated by ware that interact with the Web in a similar way, learn executing different malware samples. We extract de- a network behavior model for each group (or family) of tailed information from the network traces, such as the malware, and then use such models to detect the pres- number and type of HTTP queries, the length and struc- ence of malware-compromised machines in a monitored tural similarities among URLs, the length of data sent network. Towards this end, we first perform behavioral and received from the HTTP server, etc. Compared with clustering of malware samples by finding structural sim- Bayer et al. [10], we do not consider the specific TCP ilarities between the sequences of HTTP requests gen- port and domain names used by the malware. We aim erated as a consequence of infection. Namely, given a (i) to group together malware variants that may contact dif- dataset of malware samples M = fm gi=1::N , we ex- ferent web servers (e.g., because they are controlled by ecute each sample m(i) in a controlled environment sim- a different attacker), and may or may not use an HTTP ilar to BotLab [20] for a time T , and we store its HTTP 2 Figure 1: Overview of our HTTP-based behavioral malware clustering system. traffic trace H(m(i)). We then partition M into clusters a large number of current and future malware sam- according to a definition of structural similarity among ples. To achieve this goal, after fine-grained cluster- the HTTP traffic traces H(m(i)); i = 1; ::; N.

Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces

Botection: Bot Detection by Building Markov Chain Models of Bots Network Behavior Bushra A

2015 Threat Report Provides a Comprehensive Overview of the Cyber Threat Landscape Facing Both Companies and Individuals

Power-Law Properties in Indonesia Internet Traffic. Why Do We Care About It

Transition Analysis of Cyber Attacks Based on Long-Term Observation—

Symantec Intelligence Report: June 2011

An Adaptive Multi-Layer Botnet Detection Technique Using Machine Learning Classiﬁers

Fortinet Threat Landscape Report Q3 2017

Download Hong Kong Security Watch Report

CONTENTS in THIS ISSUE Fighting Malware and Spam

CONTENTS in THIS ISSUE Fighting Malware and Spam

Botnet Detection Using On-Line Clustering with Pursuit Reinforcement Competitive Learning (PRCL)

The Real Face of KOOBFACE: the Largest Web 2.0 Botnet Explained