USOO8161548B1
(12) United States Patent (10) Patent No.: US 8,161,548 B1 Wan (45) Date of Patent: Apr. 17, 2012
(54) MALWARE DETECTION USING PATTERN 7,155,742 B1* 12/2006 Szor ...... 726/25 CLASSIFICATION 7,181,583 B2 2/2007 Saika 7,188,369 B2 3/2007 Ho et al. 7,228,566 B2 6/2007 Caceres et al. (75) Inventor: Justin Wan, Nanjing (CN) 7,234,076 B2 6/2007 Daynes et al. 7,263,616 B1 8, 2007 Brackett (73) Assignee: Trend Micro, Inc., Tokyo (JP) 7,308,449 B2 12/2007 Fairweather 7,370,360 B2 5/2008 van der Made (*) Notice: Subject to any disclaimer, the term of this 2. R 539 Marinescu past its, listed under 35 7,409,719 B2 8/2008 Armstrong et al. .S.C. 154(b) by yS. 7,441,234 B2 10/2008 Cwalina et al. 7.487,543 B2 2/2009 Arnold et al. (21) Appl. No.: 11/204,567 7,526,809 B2 4/2009 Liang et al. (22) Filed: Aug. 15, 2005 (Continued)Continued (51) Int. Cl. OTHER PUBLICATIONS G06F2L/00 (2006.01) Office Action dated Jun. 3, 2009 in U.S. Appl. No. 1 1/356,600. (52) U.S. Cl...... 726/22; 726/23: 726/24; 726/25; Continued 713/188 (Continued) (58) Field of Classification a 35.737s".6% Primary Examiner — Vivek Srivastava S ee application file for complete search history. Assistant Examiner — Thong Truong (74) Attorney, Agent, or Firm — Beyer Law Group LLP (56) References Cited (57) ABSTRACT U.S. PATENT DOCUMENTS A malware classifier uses features of Suspect Software to 5,381545 A 1/1995 Baker et al. classify the Software as malicious or not. The classifier uses a 5,907,834. A * 5/1999 Kephart et al...... T06/20 pattern classification algorithm to statistically analyze com 5,950,003 A 9, 1999 Kaneshiro et al. 6,002,869 A 12/1999 Hinckley puter software. The classifier takes a feature representation of 6,067,410 A 5/2000 Nachenberg the software and maps it to the classification label with the use 6,128,630 A 10, 2000 Shackelford of a trained model. The feature representation of the input 6,161,130 A * 12/2000 Horvitz et al...... TO9,206 computer software includes the relevant features and the val 6,266,811 B1 7, 2001 Nabahi ues of each feature. These features include the categories of: 6,539,501 B1 3/2003 Edwards 6,785,818 B1 8, 2004 Sobel et al. applicable software characteristics of a particular type of 6,877,109 B2 4/2005 Delaney et al. malware; dynamic link library (DLL) and function name 6,973,577 B1 * 12/2005 Kouznetsov ...... 726/25 strings typically occurring in the body of the malware; and 6,993,537 B2 1/2006 Buxton et al. other alphanumeric strings commonly found in malware. By 7,039,830 B2 5/2006 Qin providing these features and their values to the classifier, the 7,047.303 B2 5/2006 Lingafelt et al. 7,093,239 B1 8, 2006 van der Made classifier is better able to identify a particular type of mal 7,096,368 B2 8/2006 Kouznetsov et al. Wa. 7,103,913 B2 9, 2006 Arnold et al. 7,120,901 B2 10/2006 Ferrietal. 22 Claims, 16 Drawing Sheets
Flow Classification of Software
load Feature Definition File ---504
load Function Definition
Obtain Suspect Software
Extract Features ---
run Classification Algorithm 1
Output Classification Label Ed US 8,161,548 B1 Page 2
U.S. PATENT DOCUMENTS 2006, O136720 A1 6/2006 Armstrong et al. 2006/O136771 A1 6/2006 Watanabe 7,562,391 B1 T/2009 Nachenberg et al. 2006, O13701.0 A1 6/2006 Kramer et al. 7,565,382 B1 T/2009 Sobel 2006, O150256 A1 7/2006 Fanton et al. 7,577,943 B2 8, 2009 Chilimbi et al. 2006/0156397 A1 7, 2006 Dai 7,581,136 B2 8, 2009 Osaki 2006/0173935 A1 8, 2006 Merchant et al. 7,587,760 B1 9, 2009 Day 2006/0230451 A1 10, 2006 Kramer et al. 7,636,946 B2 12, 2009 Verma et al. 2006/0236049 A1 10, 2006 Iwamura 7,664,923 B2 2, 2010 Kim et al. 2006/0242636 A1 10, 2006 Chilimbi et al. 2002/0178374 A1 11, 2002 Swimmer et al. 2006/0242701 A1 10, 2006 Black et al. 2003/0O23733 A1 1, 2003 Lingafelt et al. 2007,0006304 A1 1/2007 Kramer et al. 2003/0041316 A1 2, 2003 Hibbeller et al. 2007/0022287 A1 1/2007 Becket al. 2003, OO65926 A1 4, 2003 Schultz et al...... T13, 188 2007/002811.0 A1 2/2007 Brennan 2003/O159070 A1 8, 2003 Mayer et al. 2007/OO74169 A1 3/2007 Chess et al. 2003/O1591.33 A1 8, 2003 Ferri et al. 2007/0O8897O A1 4/2007 Buxton et al. 2003/O187853 A1 10, 2003 Hensely et al. 2007/0094728 A1 4/2007 Julisch et al. 2003,019 1782 A1 10, 2003 Buxton et al. 2007/0094.734 A1 4/2007 Mangione-Smith et al. 2003/02O8500 A1 11, 2003 Daynes et al. 2007/O150957 A1 6/2007 Hartrell et al. 2003/0212902 A1 11, 2003 van der Made 2007/0162975 A1 7/2007 Overton et al. 2004, OO15712 A1 1, 2004 Szor 2007. O168285 A1 7/2007 Girtakovskis et al. 2004, OO15879 A1 1, 2004 Pauw et al. 2007/0180528 A1 8, 2007 Kane 2004/OO34794 A1 2, 2004 Mayer et al. 2007/0256127 A1 11/2007 Kraemer et al. 2004.0034.813 A1 2, 2004 Chaboud et al. 2007/0271273 A1 11/2007 Cradick et al. 2004, OO64736 A1 4, 2004 Obrecht et al. 2008/0066069 A1 3/2008 Verbowski et al. 2004/OO73653 A1 4, 2004 Hunt et al. 2008/0256137 A1 10, 2008 Kawamura et al. 2004/OO986O7 A1 5, 2004 Alagna et al. 2008/0289042 A1 11/2008 Bai et al. 2004/01 11557 A1 6, 2004 Nakatani et al. 2009/0055166 A1 2/2009 Moyle 2004/O128355 A1 T/2004 Chao et al...... 726/22 2004/O153878 A1 8, 2004 Bromwich et al. 2009/0083,855 A1 3/2009 Apapetal. 2004/O158819 A1 8, 2004 Cuomo et al. OTHER PUBLICATIONS 2004/O1998.27 A1 10, 2004 Muttik et al. 2004/0215972 A1 10, 2004 Sung et al...... T13 201 Office Action dated Jun. 15, 2009 in U.S. Appl. No. 1 1/247,349. 2004/O25O107 A1 12, 2004 Guo Office Action dated Sep. 15, 2009 in U.S. Appl. No. 1 1/181,320. 2005/0033553 A1 2, 2005 Swaine et al. Office Action dated Feb. 18, 2009 in U.S. Appl. No. 1 1/181,320. 2005/0060528 A1 3, 2005 Kim Office Action dated Jan. 16, 2009 in U.S. Appl. No. 1 1/247,349. 2005, OO60699 A1 3, 2005 Kim et al. Notice of Allowance dated Jun. 1, 2010 in U.S. Appl. No. 1 1/181,320. 2005/0O81053 A1 4, 2005 Aston et al. Office Action dated Dec. 10, 2009 in U.S. Appl. No. 1 1/247,349. 2005/0216759 A1 9, 2005 Rothman et al. Notice of Allowance dated Jun. 1, 2010 in U.S. Appl. No. 1 1/247,349. 2005/0268338 A1 12, 2005 van der Made Office Action dated Dec. 22, 2009 in U.S. Appl. No. 1 1/356,600. 2006, OO15940 A1 1, 2006 Zamir et al. Office Action dated Apr. 2, 2010 in U.S. Appl. No. 1 1/356,600. 2006,004 1942 A1 2, 2006 Edwards Notice of Allowance dated Aug. 3, 2010 in U.S. Appl. No. 2006/00479.31 A1 3, 2006 Saika 1 1/356,600. 2006.0075499 A1 4, 2006 Edwards et al. 2006, O123481 A1 6, 2006 Bhatnagar et al. * cited by examiner
U.S. Patent Apr. 17, 2012 Sheet 2 of 16 US 8,161,548 B1
File Header Example 210
MS DOS MZ Header
MS DOS Stub Program DataDirectory VirtualAddress
PE File Signature
VirtualAddress
PE File Header (Machine, NumberOfSections,...)
PE File Optional Header VirtualAddress (SizeCfCode, Sizedfimage, DataDirectory...)
Section Headers
FIG 2 U.S. Patent Apr. 17, 2012 Sheet 3 of 16 US 8,161,548 B1
262 264 Function Name String Feature Value
Count AutoStart Keys Number of auto-start registry keys found in the body of the software. Count Binding Keys Number of file binding registry keys found in the body of the software Count Binding Keys Number of file binding registry keys found in the body of the sotware. Count EXE Files Number of strings ending with "...exe" found in the body of the software. Call Socket Connect 1 Or O. Whether the Software calls Connect. Call CreateFile 1 or 0. Whether the Software calls CreateRile. Call CopyFile 1 or 0. Whether the software calls CopyFile. Call DeleteFile 1 Or O. Whether the Software calls Deletefile. Call GetWindowsOirectory 1 or O. Whether the software calls Get WindowsDirectory. Call MAPSendmail 1 Or O. Whether the Software calls MAP SendMail. Call Outlook 1 or O. Whether the Software Calls Outlook. Call OutlookExpress 1 or O. Whether the Software calls OutlookExpress. Call Word 1 or O. Whether the Software calls Word. Count HTML Tags Number of HTML tags in the body of the Software. Count Kazza Number of strings with "Kazza" in it. Count MSN Number of strings with "MSN Messenger" in it. Count AOL Number of strings with "AOL" in it. Count Crack Number of strings with "Crack" in it.
260 FIG. 3 Function Names as Features U.S. Patent Apr. 17, 2012 Sheet 4 of 16 US 8,161,548 B1
3 O O
Major Linker Version
Major Linker Version
SizeCofimage 217762
SizeCfCodeSizeCfImage O. 13
Size Of mitialized Code? 0.79 310 SizeCfImage ImportTableSize/SizeCfImage 0000280.00028
ResourceSize/Sizedfimage 0.75
Entry Point Location Third Section
User32.dll Ws2 32.dll COmCt132.dll 320
aAvapi32.dll ntd.dll
TFTP Strings
Game Names
Software\MicroSoftWindows\CurrentVersion
FIG. 4 Worm Features Example U.S. Patent Apr. 17, 2012 Sheet 5 of 16 US 8,161,548 B1
472
U.S. Patent Apr. 17, 2012 Sheet 6 of 16 US 8,161,548 B1
Flow Classification of Software
Load Feature Definition File 504
Load Function Definition 508
512 Obtain Suspect Software
516 Extract Features
520 Run Classification Algorithm
O 524 Output Classification Label U.S. Patent Apr. 17, 2012 Sheet 7 of 16 US 8,161,548 B1
0Z).
U.S. Patent Apr. 17, 2012 Sheet 8 of 16 US 8,161,548 B1
low Produce Trained Model
Determine Classification 604 Labels
Select Features and Add to 608 Feature Definition File
612 Collect Training Samples
616 Select Parameters
62O Run Training Application
624 Output Trained Model
628 Output Measurement Results
632 Validate Results
FIG. 9 U.S. Patent Apr. 17, 2012 Sheet 9 of 16 US 8,161,548 B1
FIG. 10A 704 U.S. Patent Apr. 17, 2012 Sheet 10 of 16 US 8,161,548 B1
708
712
FIG 10B U.S. Patent Apr. 17, 2012 Sheet 11 of 16 US 8,161,548 B1
716 Kfeature name="match/MAP SendMail"/> 720
Kfeature name="match/WNetAddOonnection2"/> 724
728 FIG. 10C U.S. Patent Apr. 17, 2012 Sheet 12 of 16 US 8,161,548 B1
732
736
740 744 748 752
" 756
" " 760 " " " &/feature-Seta " 764 Cfeature name="match/Westwood\Red Alert"/> 768 FIG 10E U.S. Patent Apr. 17, 2012 Sheet 14 of 16 US 8,161,548 B1
772
776 780
820 830 spyware, etc., to affect a computer as benign Software. A training application is executed that such that it will not behave as expected. Malicious software outputs a trained model for identifying the particular type of can delete files, slow computer performance, clog e-mail malware. A second embodiment is a method for classifying Suspect accounts, steal confidential information, cause computer Software. First, a group of features relevant to a particular type crashes, allow unauthorized access and generally perform of malware are selected along with a trained model that has other actions that are undesirable or not expected by the user been trained to identify the same type of malware. The mal of the computer. ware classifier extracts features and their values from Suspect Current technology allows computer users to create back Software and inputs same to a classification algorithm. The ups of their computer systems and of their files and to restore classification algorithm outputs a classification label for the their computer systems and files in the event of a catastrophic Suspect Software, identifying it as malware or as benign. failure Such as a loss of power, a hard drive crash or a system 25 A third embodiment is a malware classifier apparatus. The operation failure. Assuming that the user had performed a apparatus includes a feature definition file having features backup prior to the failure, it can be straightforward to restore known to be associated with the type of malware, a model their computer system and files to a state prior to the computer being trained to identify that malware, a feature extraction failure. Unfortunately, these prior art techniques are not effec module and a pattern classification algorithm. In one specific tive when dealing with infection of a computer by malicious 30 embodiment, the classification algorithm is the Support vector software. It is important to be able to detect such malware machine (SVM) algorithm. when it first becomes present in a computer system, or better yet, before it can be transferred to a user's computer. BRIEF DESCRIPTION OF THE DRAWINGS Prior art techniques able to detect known malware use a predefined pattern database that compares a known pattern 35 The invention, together with further advantages thereof, with Suspected malware. This technique, though, is unable to may best be understood by reference to the following descrip handle new, unknown malware. Other prior art techniques use tion taken in conjunction with the accompanying drawings in predefined rules or heuristics to detect unknown malware. which: These rules take into account Some characteristics of the FIG. 1 is a block diagram of a malware classifier according malware, but these rules need to be written down manually 40 to one embodiment of the invention. and are hard to maintain. Further, it can be very time-con FIG. 2 illustrates the header of a file in portable executable suming and difficult to attempt to record all of the rules format. necessary to detect many different kinds of malware. Because FIG. 3 is a table illustrating the use of function names as the number of rules is often limited, this technique cannot features as well as alphanumeric strings. achieve both a high detection rate and a low false-positive 45 FIG. 4 illustrates a list of features and their values from a rate. real-world worm. Given the above deficiencies in the prior art in being able to FIG.5 illustrates a hyperplane used in the SVM algorithm. detect unknown malware efficiently, a suitable solution is FIG. 6 illustrates a situation in which the training samples desired. are not linearly separable. 50 FIG. 7 is a flow diagram describing the classification of SUMMARY OF THE INVENTION computer Software. FIG. 8 is a block diagram illustrating the creation of a To achieve the foregoing, and in accordance with the pur trained model. pose of the present invention, a malware classifier is disclosed FIG. 9 is a flow diagram describing training of the classi that uses features of suspect software to classify the software 55 fication algorithm and the creation of a trained model. as malicious or not. The present invention provides the ability FIGS. 10A-10F show portions of a feature definition file. to detect a high percentage of unknown malware with a very FIG. 11 is an example showing a trained model output by low false-positive rate. the training application for the purposes of detecting a com A malware classifieruses a pattern classification algorithm puter worm. to statistically analyze computer software in order to catego 60 FIGS. 12A and 12B illustrate a computer system suitable rize it by giving it a classification label. Any Suspect computer for implementing embodiments of the present invention. software is input to the malware classifier with the resulting output being a label that identifies the Software as benign, DETAILED DESCRIPTION OF THE INVENTION normal software or as a particular type of malicious Software. The classifier takes a feature representation of the software 65 The present invention is applicable to all malicious soft and maps it to the classification label with the use of a trained ware, or malware, that generally causes harm to a computer model, or function definition. system, provides an effect that is not expected by the user, is US 8,161,548 B1 3 4 undesirable, illegal, or otherwise causes the user to want to allowing the attacker to easily regain access later or to exploit restore their computer system from a time prior to when it was software to attack other systems. Because they often hook infected by the malware. Malware can be classified based into the operating system at the kernel level to hide their upon how is executed, how it spreads or what it does. The presence, root kits can be very hard to detect. below descriptions are provided as guidelines for the types of 5 Key logger Software is Software that copies a computer malware currently existing; these classifications are not per user's keystrokes to a file which it may send to a hacker at a fect in that many groups overlap. Of course, later developed later time. Often the key logger software will only awaken when a computer user connects to a secure web site such as a software not currently known may also fall within the defini bank. It then logs the keystrokes, which may include account tion of malware. numbers, PINs and passwords, before they are encrypted by When computer viruses first originated common targets 10 the secure web site. A dialer is a program that replaces the were executable files and the boot sectors of floppy disks; telephone number in a modems dial-up connection with a later targets were documents that contain macro Scripts, and long-distance number (often out of the country) in order to more recently, many computer viruses have embedded them run up telephone charges on pay-per-dial numbers, or dials selves in e-mail as attachments. With executable files the out at night to send key logger or other information to a virus arranges that when the host code is executed the virus 15 hacker. Software known as URL injection software modifies code is executed as well. Normally, the host program contin a browser's behavior with respect to some or all domains. It ues to function after it is infected by the virus. Some viruses modifies the URL submitted to the server to profit from a overwrite other programs with copies of themselves, thus given scheme by the content provider of the given domain. destroying the program. Viruses often spread across comput This activity is often transparent to the user. ers when the software or document to which they are attached The present invention is suitable for use with a wide variety is transferred from one computer to another. Computer of types and formats of malware. The below description pro worms are similar to viruses but are stand-alone software and vides an example of the use of the invention with malware thus do not require host files or other types of host code to written in the portable executable (PE) format. As is known in spread themselves. They do modify the host operating sys the art, the portable executable format is an executable file tem, however, at least to the extent that they are started as part 25 format used in 32-bit and 64-bit versions of Microsoft oper of the boot process. In order to spread, worms either exploit ating systems. The portable executable format is a modified some vulnerability of the target host or use some kind of version of the UNIXCOFF file format. Ofcourse, the present Social engineering to trick users into executing them. invention applies to computer files in other formats as well. A Trojan horse program is a harmful piece of software that Malware Classifier is often disguised as legitimate software. Trojan horses can 30 not replicate themselves, unlike viruses or worms. A Trojan A malware classifier is a software application that uses a horse can be deliberately attached to otherwise useful soft pattern classification algorithm to statistically analyze com ware by a programmer, or can be spread by tricking users into puter Software in order to categorize it by giving it a classifi believing that it is useful. Some Trojan horses can spread or cation label. Any suspect computer Software may be input to activate other malware, such as viruses (a dropper). A wabbit 35 the malware classifier with the resulting output being a label is a third, uncommon type of self-replicating malware. Unlike that identifies the Software as benign, normal Software or as a viruses, wabbits do not infect host programs or documents. particular type of malicious Software. The classifier takes a And unlike worms, rabbits do not use network functionality feature representation of the software and maps it to the to spread to other computers. A simple example of awabbit is classification label with the use of a trained model, or function a fork bomb. 40 definition. Spyware is a piece of Software that collects and sends FIG. 1 is a block diagram of a malware classifier 100 information (such as browsing patterns or credit card num according to one embodiment of the invention. Input to clas bers) about users and the results of their computer activity sifier 100 is computer software 110 which is suspected of without explicit notification. Spyware usually works and being malware. A feature definition file 120 lists all relevant spreads like Trojan horses. The category of spyware may also 45 features of any potential computer software and the corre include adware that a user deems undesirable. A backdoor is sponding attributes for each feature. Feature extraction mod a piece of software that allows access to the computer system ule 125 is computer software that extracts values for the by bypassing the normal authentication procedures. There are defined features from the input computer software 110. two groups of backdoors depending upon how they work and Trained model 130 is the trained classification function in the spread. The first group work much like a Trojan horse, i.e., 50 form of a computer file that is output by a separate training they are manually inserted into another piece of Software, application as described below. Model 130 is trained by map executed via their host software and spread by the host soft ping a vector of features into one of several classes by looking ware being installed. The second group work more like a at many input-output examples. Pattern classification algo worm in that they get executed as part of the boot process and rithm 140 is any suitable pattern classification algorithm that are usually spread by worms carrying them as their payload. 55 accepts feature values and the trained model as input and The term ratware has arisen to describe backdoor malware outputs a classification label 150, or class, for the input com that turns computers into Zombies for sending spam. puter software 110. Classification algorithm 140 is designed An exploit is a piece of software that attacks a particular to approximate the behavior of the trained model. security Vulnerability. Exploits are not necessarily malicious As alluded to above, an effective malware classifier relies in intent—they are often devised by security researchers as a 60 upon a suitable classification algorithm, a set of features, way of demonstrating that Vulnerability exists. They are, feature normalization methods, and training samples (i.e., however, a common component of malicious programs such examples of benign Software and malware). as network worms. A root kit is software inserted onto a computer system after an attacker has gained control of the Software Features system. Root kits often include functions to hide the traces of 65 the attack, as by deleting logged entries or by cloaking the Current technologies for detecting malware include noting attacker's processes. Root kits might include backdoors, malicious behavior Such as an abnormal TCP connection on a US 8,161,548 B1 5 6 given port or the adding of a registry key that automatically Of course, header 210 is specific to a portable executable loads itself when the operating system starts. Certain types of format, other file types will have other relevant header infor malware, however, have behaviors that can be difficult to mation and characteristics. track. A worm, for example, can create processes with differ Another category of features include dynamic link library ent names on different machines and can behave differently (DLL) and function name Strings occurring in the body of the on different machines, all of which make its behavior difficult Software. This category enumerates DLL name strings and to track. function name strings that might be imported by Suspected But, each type of malware exhibits a certain pattern which malware. In this particular embodiment, the enumerated is different from that of benign computer software. A worm, strings are those that might be used by malware in a portable 10 executable format. Each name string is considered a feature for example, is likely to call RegCreateKey and RegSetValue, and the value of each of these features will either be one or tO add al entry in Zero depending upon whether the name string occurs in the HKLM\Software\Microsoft\CurrentVersion\Run, and to call body of the Suspect computer Software. For example, con connector CopyFile or CreateEile in order to propagate itself. sider kernel32.dll, comctl32.dll, urlmon.dll, shell32.dll, Plus, most of the effort expended by the worm involves propa 15 advapi32.dll, InterlockedIncrement, GetThreadLocale as gating itself and damaging files, so there are not many calls to features Fk, Fc. Fu, Fs, Fa, Fiand Fg accordingly. For a given GDI functions or to Common Controls functions. Further, the suspect computer software, if only the strings “advapi32.dll header of a worm written in a portable executable format will and “GetThreadLocale” are found in its body, then the values have certain characteristics. Each of the other various types of of Fa and Fg are each one while the other values are all Zero. malware (such as viruses, spyware, adware, etc.) also will Other possible functions include RegDelete Value, RegE have distinctive characteristics and will exhibit distinctive numValue, CreateThread and CreatePipe, etc. behavior. It is therefore realized that a known pattern classi FIG.3 is a table 260 illustrating the use of function names fication algorithm may be used to analyze these features of as features, as well as alphanumeric strings described below. computer software suspected of being malware and to output This table lists examples of those function names that are a result that classifies the computer Software as benign or as a 25 commonly associated with malware; many other function particular type of malware. names are possible. Column 262 lists examples of function In one embodiment of the invention, a specific feature names ("CallxXX”) that might appear as Strings within the definition file is used to classify each type of malware. For body of Suspect computer software, as well as feature names example, if it is decided to implement a malware classifier that perform a count of particular alphanumeric strings 30 founds in the software (“CountXXX”). Colume 264 lists the that will detect computer worms then a feature definition file corresponding value for each function name that is consid is constructed having specific features relevant to computer ered a feature. While the "Call” feature names will have a worms. On the other hand, if it is desired to detect spyware, a value of one or Zero, the “Count feature names will have any separate feature definition file is used including features integer value depending upon the particular data. known to be present in spyware. Of course, if the goal is to 35 Because many malware programs are packed, leaving only detect computer worms, then training data is Supplied to the the stub of the import table or perhaps even no import table, training application (described below) having examples of the malware classifier will search for the name of the dynamic computer worms and benign software. The resulting trained link library or function in the body of the suspected malware. model is tuned specifically to detect computer worms and is Adding more function names or dynamic link library names used in conjunction with a feature definition file containing 40 as features will likely yield better classification results. worm features. A third category of features include alphanumeric strings In an alternative embodiment, it is possible that a single commonly found in malware. These are strings identifying feature definition file may be used to detect two or more types registry keys, passwords, games, e-mail commands, etc. that of malware. For example, two sets of features identifying two malware typically uses. The presence of a quantity of these types of malware are combined into one large feature set. 45 strings in a given computer Software program indicates it is Assume features fo, fl and f2 are for detecting malware type more likely than not that the software is malware. For #1, and that features f3, f4 and fS are for detecting malware example, a string indicating that computer software has been type #2. The combined features set f(), fl, f2, f3, f4 and fS is compressed by tool like UPX is a good indicator that the used to detect malware types #1 and #2 by using a classifica Software might be malware since benign computer Software tion algorithm that combines the logic of the classification 50 seldom uses that tool. Also, malware often steals and uses the functions for detecting malware types #1' and #2. CD keys for some of the common computer games. Feature definition file 120 lists all of the relevant features Examples of these strings include auto-run registry keys and the attributes of each feature that might possibly be Such as encountered in computer software 110. These features CurrentVersion\Run include the categories of applicable software characteristics; 55 CurrentVersion\Run Services dynamic link library (DLL) and function name Strings occur HKLM\Windows\Software\Microsoft\CurrentVersion\Run ring in the body of the software; and other strings commonly and found in malware. Other types of features may also be used. HKCRVexefile\shell\open\command. In this embodiment, the applicable software characteristics Other examples include commonly used passwords such as include the fields of the header of a file in portable executable 60 “administrator,” “administrateur,” “administrador. “1234. format. For example, these fields are: packed, packer, number “password 123 “admin123' etc.; registry keys or installa of sections, image size, code section size, import table size, tion paths of games such as “Illusion Softworks\Hidden & export table size, resource size. Subsystem, initialized section Dangerous 2.” “Electronic Arts\EA Sports” and size, on initialized section size, image base address, and entry “Westwood\Red Alert': SMTP commands such as “MAIL point location. FIG. 2 illustrates the header 210 of a file in 65 FROM:” and “RCPT TO:”: peer-to-peer application names portable executable format. Shown is relevant header infor such as “KaZaA.” “emule,” “WinMX.” “ICO,” “MSN Mes mation that contain Suitable characteristics to use as features. senger.” Yahoo Messenger.” etc.; HTML syntax such as US 8,161,548 B1 7 8 “, (8. ture name="a biGetVersionEXA eature name="pef code-size > (8. ture name="a biGetCurrentThreadId eature name="pefinitialized-data-size's (8. ture name="a bi Deletefile A eature name="pefuninitialized-data-size's (8. ture name="a if CetDC.- eature name="pef entry-point-location/> (8. ture name="a pi/RaiseException's categorie s=“OxO1000000, 0x00400000/> (8. ture name="a pi/UnhandledExceptionFilter's < eature name="pef import-table-size's (8. ture name="a pi/HeapDestroy’ > eature name="pefresource-table-size's 10 (8. ture name="a pi TerminateProcessic eature name="pef count-imported-dlls' range="0.47.> (8. ture name="a biGetFileSize' > eature name="pef count-imported-functions' (8. ture name="a if CetCPInfo.c- 1000/> (8. ture name="a bi HeapCreates eature name='pet shell categories="none,upx.aspack (8. ture name="a if CetACP c 15 (8. ture name="a biGetVersion's eature name="dll/kernel32.dll range="0,826'> (8. ture name="a pi/GetEnvironmentStrings's eature name="dll/user32.dll range="0,695/> (8. ture name="a bi SetHandleCount's eature name="dll/advapi32.dll range="0,565's (8. ture name="a pi/GetStringTypeW/> eature name="dll/shell32.dll range="0.406"> (8. ture name="a bi SetTimer'> eature name="dll/ole32.dll range="0,304/> (8. ture name="a pi Interlocked Decrement - eature name="dll/gdi32.dll range="0,543’ > (8. ture name="a biwSprintf A eature name="dll/oleaut52.dll range="0.360/> (8. ture name="a if CetOEMCP c eature name="dll/wininet.dll range="0,208/ (8. ture name="a bi Show Window’. > eature name="dll/comctl32.dll range="0,82> (8. ture name="a pi/GetStringTypeA's eature name="dll/msvcrt.dll range="0,779/> (8. ture name="a pi/LCMapStringA's eature name="dll/rasapi32.dll range="0,145’s- (8. ture name="a pi/FreeFnvironmentStringSA's eature name="dll/version.dll range="0,16'> (8. ture name="a pi CoCreateInstance' - eature name="dll/comdl.g32.dll range="0.26'> 25 (8. ture name="a pi/FreeFnvironmentStringsWD> eature name="dll/wsock32.dll range="0,75'- (8. ture name="a pi/GetEnvironmentStringsW/> eature name="dll/mfc42.dll range="0,6933/> (8. ture name="a pi/RegCreateKeyEXA/> eature name="dll/rpcrt4.dll range="0,471.> (8. ture name="a pi/LCMapStringW/> eature name="dll/shlwapi.dll range="0,749'> (8. ture name="a pi/Dispatch Message.A's eature name="dll/urlmon.dll range="0,77/> (8. ture name="a pi InterlockedIncrement - eature name="dll/ws2 32.dll range="0,109'> 30 (8. ture name="a biCreateThread> eature name="dll/msvbvmé0.dll range="0,634/> (8. ture name="a pi/InternetOpenA'. - eature name="dll/winspool.drv' range="0,167"> (8. ture name="a pi/CreateDirectory A/> eature name="dll/winmm.dll range="0,198 (8. ture name="a bi SetWindowPos'> eature name="dll/lz32.dll range="0,13/> (8. ture name="a pi/WaitForSingleObject"/> 35 (8. ture name="a pi/DeleteObject/> eature name="api? GetProcAddressic (8. ture name="a biGetClientRect.> eature name="api Exitprocessic (8. ture name="a biFindFirstFileAs eature name="api/LoadLibrary AD (8. ture name="a pillstrcpynA'. - eature name="api/RegCloseKey's (8. ture name=" api Reg|DeleteKeyA's eature name="api? GetModuleHandle A - (8. ture name="a pi LocalAlloc' - eature name="api? CloseHandle' - (8. ture name="a biFindClose' > 40 eature name="api? GetModuleFileNameA. c (8. ture name="a pi/PostOuitMessage's eature name='api WriteFile' > (8. ture name="a biGetWindowTextA eature name="api? GetLastErroric (8. ture name="a bifBitBlti eature name="api? GetCommandLineAi (8. ture name="a pi/TranslateMessage's eature name="api/MultiByteToWideChar/> (8. ture name="a pilstreatAi eature name="api? CreateFileAi (8. ture name="a bi EnableWindow’. > eature name="api/GetStartupInfoA's 45 (8. ture name="a biCreateWindow EXA eature name="api/WideCharToMultiByte/> (8. ture name="a piLoadIconAi eature name='abi SetFile:Pointer's (8. ture name="a pi/GetTempPathA/> eature name="api, VirtualAlloc' - (8. ture name="a biGetWindow Rect's eature name="apifReadFile’ - (8. ture name="a pi. GetLocaleInfoA - eature name="api, VirtualFree' - (8. ture name="a biFlushFile:Buffers'> eature name="api/RegQuery ValueEXA/> 50 (8. ture name="a bi IsWindow's eature name="api Rt Unwind - (8. ture name="a biCreateProcess.As eature name="api/GetFileType (8. ture name="a pi/DestroyWindow? eature name="api? GetStdElandle' > (8. ture name="a piLoadCursor A - eature name="apillstrlenAi (8. ture name="a pi/SelectObject/s eature name="api/MessageBoxA's (8. ture name="a bi SetStdEandle' > eature name="api/FreeLibrary's 55 (8. ture name="a bi SetWindowTextA eature name="api/RegOpenKeyEXA/> (8. ture name="a pi/GetDlgItem/> eature name="api? CoInitialize > (8. ture name="a pi/PostMessageA's eature name="api/Istrcpy A.D. (8. ture name="a pillstrcmpiA’s eature name="api/Sleep's (8. ture name="a pi/RegDeleteValueA/> eature name="api? GetCurrentProcess (8. ture name="a pi/SetWindowLongA's eature name="api, ShellExecute A - (8. ture name="a biGetTickCount's 60 eature name="api. Initialize(CriticalSection - (8. ture name="a pi/GetWindows.Directory A/> eature name="api? LeaveCriticalSection - (8. ture name="a bi Ras)ialA eature name="api EnterCriticalSection - (8. ture name="a pi/GetMessage.A's eature name="api/RegSetValueEXA/> (8. ture name="a biKillTimer'> eature name="api DeleteCriticalSection - (8. ture name="a biGetThreadLocale's eature name='abi SetEndOfFile' > (8. ture name="a biCharNextA's eature name="api/Heap Alloc'> 65 (8. ture name="a bi SetFocus > eature name="api/HeapFree"/> (8. ture name="a pi/GetStockObject/> US 8,161,548 B1 15 16 -continued -continued
(8. ture name="a biCreateSolidBrush 10 15 (8. ture name="a pi/GetSysColor/> (8. ture name="a pi/GetDeviceCaps/> (8. ture name="a piFindResource A - (8. ture name="a biHeapSize' > (8. ture name="a pi/CreateCompatibleDC/> Dialer Feature Definition File Example (8. ture name="a pif GetEnvironmentVariableA (8. ture name="a pi InvalidateRect (8. ture name="a piLoadResource - As previously mentioned, the present invention may be (8. ture name="a biFiRect.> (8. ture name="a bi SetLastError> 25 used to detect a wide variety of types of malware. Listed (8. ture name="a bi IsBadReadPtr below is an example feature definition file for detecting dialer (8. ture name="a bifGetParent> malware. One of skill in the art, upon a reading of the speci (8. ture name="a pi/DialogBoxParamA/> fication and the examples contained herein, would be able to (8. ture name="a biCreateMutex.A's (8. ture name="a pi/SystemParametersInfoA's use invention to detect a variety of other types of malware. (8. ture name="a pi/SetUnhandledExceptionFilter's 30 (8. ture name="a pi/CompareString A (8. ture name="a pi/PeekMessageA/> (8. ture name="a pi InternetCloseHandle' > (8. ture name="a bi,ISBadWritePtrf 35 (8. ture name="a pi/RemoveDirectory AD (8. ture name="a bifGetWindow’. > (8. ture name="a bifGetShortPathNameA's 50 65 Tunisia,Thailand,Taiwan, Syria, Switzerland, Sweden, Spain,South (8. ture name="a biCreateEontA Africa,Slovenia,Slovak Republic,Singapore,Serbia, US 8,161,548 B1 17 18 -continued -continued Saudi Arabia.Russia,Romania,Qatar,Portugal, Poland.Philippines.Paraguay, that includes second features relevant to the classifica benign; an exploit, a root kit, key logger software, a dialer or URL eature name="ma h. CurrentVersion'Run - injection software. eature name="ma h. CurrentVersion'RunOnce - 3. A method as recited in claim 1 wherein said type of eature name="ma h. CurrentVersion'RunServices - eature name="ma h. CurrentVersion'RunServicesOnce - 45 malware is a worm, spyware or a dialer. eature name="ma h/txtfile\shell\open command's 4. A method as recited in claim 1 wherein said character 60 outputting said classification label for said previously un 8. A method as recited in claim 1 wherein executing a