
Semi-Supervised Learning on Email Characteristics for Novel Worm Detection Steve Martin and Anil Sewani {steve0, anil}@cs.berkeley.edu University of California, Berkeley A major drawback of unsupervised learning for worm tion among vulnerable hosts can already occur. To detection is the possibility of false negatives. Previous make matters worse, contemporary worms spread at work copes with this problem by increasing the sensi- extremely fast rates; it has been shown that it is rea- tivity of the unsupervised classification algorithms. sonably trivial to create self-propagating internet vi- This, in turn, creates many more false positives. Our ruses that can infect up to a million hosts in about focus is narrowed to worms propagating through thirty seconds [19]. email. It is clear that the key to stopping a novel worm We present the following contributions. First, from becoming widespread is to choke off its avenues we examine a wide range of features calculated on for infection as quickly as possible. One general ap- email traffic to determine indicators that discriminate proach is to place specific restrictions on the network between infected from normal email behavior. Using activity a machine can conduct over a given length of these features, we next present a new method that uses time. This idea, called a virus throttle, works well in semi-supervised learning for adaptive virus detection certain scenarios, but is also overly restrictive and that leverages system administrator feedback to im- inflexible to changes in network utilization [5, 8]. prove classification. Our approach combines the An alternate idea is to learn on network transmis- strengths of sensitive novelty detection with a para- sions to attempt to discover and isolate infected ma- metric classifier to drastically reduce the false posi- chines before an epidemic can set in. Previous work tives. has used unsupervised learning on network behavior to attempt to detect worm propagation. However, because false negatives are highly undesirable in virus 1 Introduction detection, these systems are in general purposefully configured to be overly sensitive. This trades off ad- One of the most prevalent security problems in com- ditional false negatives for excessive amounts of false puting today is the rampant proliferation of malicious, positives. self-propagating computer viruses known as worms. To this end, we present the following contribu- As networks become increasingly ubiquitous, these tions. First, we examine a wide range of features cal- programs can infect more machines than ever before, culated on email traffic to determine what indicators with each new outbreak causing staggering amounts best separate virus from normal email behavior. Us- of damage. As an example, in 2001, two worms ing these features, we next present a new method for named Nimda and Code Red spread rapidly through- doing semi-supervised learning that leverages system out the internet, with cost estimates for lost productiv- administrator involvement to improve classification. ity in the billions of dollars [3]. Last year, the My- This is done via an approach that combines the Doom and MS Blaster worms combined for estimated strengths of sensitive novelty detection with a para- damages in the tens of billions of dollars in the United metric classifier that drastically reduces the false posi- States alone [2]. These recent events serve to high- tives. light a disturbing trend: as industry, governments, and This paper presents the results of our work. Sec- militaries grow more reliant on computer networks, tion 2 gives a brief overview of previous work related the potential for a truly crippling virus attack is be- to this paper. Next we discuss the overall structure of coming immense. our application in section 3, followed by a description While protection against worms continues to be of the numerical features we calculate on email traffic an area of intense research, traditional antivirus de- in section 4. Section 5 describes the parametric classi- fenses deployed in the field have not changed signifi- fier. Section 6 gives some preliminary results and cantly for many years. Recent worms such as MS discussion, and we close with some conclusions and Blaster and Sapphire have demonstrated that in the ideas for future research. time it takes for an administrator to reconfigure a fire- wall and update antivirus scanners, widespread infec- 1 2 Previous Work trained over consecutive byte sequences in email at- tachments to detect malicious executables [9]. They Current work on containing worms (including email later extended their work to include printable strings worms) has largely been based on automatic detection in executables as another feature to train their models of abnormal network behavior. The La Brea project, [10]. Maloof et al have further explored this area of for example, attempts to slow the growth of TCP work by using models other than naïve Bayes based worms by slowing probes to unallocated ad- (including decision trees, Support Vector Machines, dresses [21]. This is accomplished by blocking the k-nearest-neighbors and Term Frequency Inverse thread making the probe. This technique however can Document Frequency (TFIDF) models) [6]. In gen- be easily circumvented by running the virus asynchro- eral, tracking malicious email by exclusively analyz- nously. ing attachments tends to ignore a certain type of email Williamson et al proposed per-host TCP throttling worms, such as those that spread via HTML embed- by restricting the rate of new connections a host ded scripts or links in emails. A glaring example of makes in a given time [5]. A similar approach has such a worm is MyDoom. The techniques mentioned been proposed for constraining email virus propaga- above can also be easily fooled by using standard code tion by restricting the frequency at which email can be obfuscation techniques. sent to new email recipients [8]. Apart from requiring Social network analysis of email has been used universal deployment to be effective, this technique for detecting spam as well as virus emails [13, 7]. In involves the difficult task of determining the threshold these models, users within a network are considered as used to consider a TCP connection or an email to be nodes of a graph, and communication between any malicious. A fixed setting of the threshold is also two nodes is indicated via an edge between the nodes. likely to delay legitimate traffic occasionally. Clusters of nodes linked to one another form a social Several previously published methods of network network, indicating common communication patterns intrusion detection involve variants of signature detec- among users. Communication that violates these be- tion [12]. Signature detection uses libraries of past havioral patterns is considered suspect of spam or attacks and rule-learning techniques to identify attacks virus email. To circumvent this technique, an email on the network. Vern Paxon's work on Bro provides a virus can trick the model by intelligently choosing its strongly-typed language for implementing policies to recipient list (e.g. by using recipient information from deal with abnormal events detected on network recent emails in the ‘Sent Email’ folder). Hence, such streams at a TCP/UDP level [15]. However, we be- techniques alone can not provide a good defense lieve this technique might be inadequate because mod- against email worms. ern virus attacks can overwhelm network resources Stolfo et al created an email data mining system before rules can be manually updated to provide pro- that used social network analysis along with other tection against new threats. features indicative of user behavior to identify viral Singh et al. recently proposed a mechanism for propagations [16, 17]. The authors maintained ‘user automated fingerprinting of worms [20]. The authors cliques’ for every user in the system. User cliques are implemented an efficient distributed system that de- basically clusters of email accounts that participate tects invariant substrings in the network traffic to fin- with each other in common email communication. gerprint novel worms. The system aggregates this Other features considered by the authors included information across different source and destination IP variance in number of distinct recipients, sending rate addresses to determine if a worm attack is underway. of email and the number of emails with attachments Even though the system is shown to work well for over a sliding window of emails. The social network new and known worms, there are some serious limita- analysis done here is vulnerable to the type of attacks tions on its effectiveness. Polymorphic worms (e.g. mentioned earlier. The classification mechanism we those that can make use of per-session encryption) can describe in this paper includes all other features used evade such a system completely by changing the pay- by the authors, along with several others that we be- load for every infection. The attacker can also exploit lieve are useful in discriminating normal user behavior the limited state information maintained by the system from viral behavior. to perform efficiently. For example, as mentioned in the paper, worms seen less frequently than 60 seconds are never registered by the system. 3 System Architecture Early attempts at using machine learning models for detection of viruses used Artificial Neural Net- The overall architecture for our system consists of a works for detecting boot sector viruses [4]. More re- modular pipeline. Each segment encapsulates a sepa- cently, Schultz et al used a naïve Bayes classifier rate stage in the process of classifying email behavior 2 Figure 1. The basic architecture for the classification system presented in this paper. as infected or normal. labels, allowing the parametric classifier to be re- Figure 1 gives an overview of our basic architec- trained in a semi-supervised manner for increased ture. In the first module of our system, email is gath- accuracy. In our planned deployable system, a ma- ered into a data corpus on an ongoing basis.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages12 Page
-
File Size-