Learning from Big Malwares

Learning from Big Malwares Linhai Song1, Heqing Huang2, Wu Zhou3, Wenfei Wu1, Yiying Zhang4 1University of Wisconsin-Madison, 2IBM T.J. Watson Research Center, 3North Carolina State University, 4Purdue University Abstract Second, VirusTotal applies a host of state-of-the-art an- This paper calls for the attention to investigate real-world tivirus engines to all submitted files, VirusTotal captures how malwares in large scales by examining the largest real mal- these engines evolve over time. ware repository, VirusTotal. As a first step, we analyzed two Third, VirusTotal provides rich metadata. Besides report- fundamental characteristics of Windows executable mal- ing whether or not a submitted file has malware, VirusTotal wares from VirusTotal. We designed offline and online tools also captures the exact type of malware, which antivirus en- for this analysis. Our results show that malwares appear in gines detected the malware, when and by whom the file is bursts and that distributions of malwares are highly skewed. submitted, and other useful metadata. There are also active malware researchers and engineers who comment and vote on each submitted file, providing valuable human inputs. 1. Introduction The VirusTotal repository exposes many new research Malwares grow exponentially [3]. AVTest [3] reports that opportunities. For example, studying malwares over a long more than 140 million new malwares appeared in 2015. Such time period and across all countries in the world can provide malwares are posing increasing threats to the human society a high-level insight into how malwares evolve over time and every day. For example, there were almost 2 million attempts over geo-locations. Studying how antivirus engines change to steal money from online bank accounts with malwares over time together with how malwares change can reveal the that exploit vulnerabilities in the Adobe Flash player [10]. effectiveness of responses to new security threats. Finally, it Researchers and practitioners continue to build security is interesting to investigate if we can build online malware tools to defend against new malwares. To assist the design prediction tools by applying machine learning techniques on of these tools, it is essential to understand malwares in the the VirusTotal data. All these insights could in turn assist real world. future antivirus researchers and engineers to better design Previous works on analyzing the behaviors and evolu- malware defense mechanisms. tions of malwares [8, 16] have provided insights into how Unfortunately, there has been little work in looking at this malwares circumvent the detection from existing antivirus valuable repository. In industry, many antivirus vendors use techniques and how malware writers create new malwares. VirusTotal to identify false negatives and false positives in However, these works only studied a limited amount of mal- their products. However, they only use VirusTotal to exam- wares that are targeted for certain types of security threats or ine suspicious files separately, and do not consider correla- antivirus engines. tions among different suspicious files. In academia, only un- Studying malwares in a large scale and with high diver- til recent did researchers begin to pay attention to mining the sity, what we call big malwares, can expose new insights VirusTotal repository. Graziano et al. [7] leveraged VirusTo- beyond these isolated studies. tal user ID information to identify malware writers who use VirusTotal [2] is a popular online service that real-world VirusTotal as a test platform. users use to analyze suspicious files and URLs. VirusTotal We propose to investigate in the VirusTotal repository does not judge whether a submission is malware by itself. In- along two directions: offline study of its rich data and meta- stead, it applies state-of-the-art antivirus engines from more data, and online analysis and prediction of malwares. Our than 50 vendors to each submitted file, and generates a sum- study can provide better understanding of malwares distri- mary report that includes the detection results of all these bution and appearance, and can help malware researchers engines. VirusTotal saves and provides an open access to all focus their effort, when designing anti-virus techniques. user-submitted files and generated reports. As a first step, we collected one month of VirusTo- The VirusTotal repository provides a valuable resource to tal repository data with more than 40 million suspicious gain insights into the behavior of malwares. First, it contains files and conducted an early-stage empirical study on them. a huge amount of real-world files. For example, there were VirusTotal supports downloading both data files and meta- more than 40 million suspicious files submitted in November data of them. In this study, we focus on studying metadata 2015 (Figure 1). Files from VirusTotal were submitted by and found that just by analyzing metadata, we can already real-world users from all over the world since 2004 and find a fair amount of valuable conclusions. Studying other involve various security threats. This amount of diverse data data could potentially provide more insights and we leave it makes VirusTotal a good representative of malwares in the for future work. real world. 1 3.0 Java Other 2.5 Audio+Video 3% 5% Submissions 3% OffiCe PE Submissions 2.5 3% 2.0 PE Malwares Text 4% 2.0 Image 1.5 4% PE PDF 1.5 40% 1.0 6% 1.0 ZIP 0.5 7% # of malwares (in million) 0.5 VirusTotal reports (in million) unknown 0.0 Web page 11% 0.0 7% Android 5 10 15 20 25 30 7% File size (log2 based) 11-02-15 11-04-15 11-06-15 11-08-15 11-10-15 11-12-15 11-14-15 11-16-15 11-18-15 11-20-15 11-22-15 11-24-15 11-26-15 11-28-15 11-30-15 Figure 3: Size distribution for mal- Figure 1: The number of files and Figure 2: File types for all submis- wares on VirusTotal in November malwares. The number of suspicious sions in November 2015. File types and 2015. The number of malwares falling files, PE files, and the number of malwares their distributions for all submissions on into each log2-based size category in submitted to VirusTotal in November 2015. VirusTotal in November 2015. November 2015. We centered our analysis around the concept of malware • We studied the burstiness of malwares using a new cache- families. A malware family is a set of malwares that have based malware prediction tool. This tool achieves greater common behaviors. We focus our initial attention on all than 90% prediction precision (Section 3). Windows executable (PE) malwares detected by Microsoft • We observe that family distributions of malwares are antivirus engines, because our previous experience shows highly skewed, i.e., some family has a huge amount that Microsoft engines can generate precise results on PE of malwares while others only include a few malwares. files. Leveraging this observation, we built a hot malware fam- We focus our analysis on two dimensions: temporal prop- ily mining tool that identifies hot malware families (Sec- erties of malwares and malware distributions. In each dimen- tion 4). sion, we investigate both offline and online analysis mechanisms. • We identify several key research opportunities on top of We began our temporal analysis with general malware the VirusTotal repository (Section 5). characteristics including the submission frequencies and generation rates of malware families. We then study the 2. Data Collection and Basic Properties burstiness of malwares—how closely malwares belonging This section discusses how we collected and preprocessed to the same family occur in time. To understand malware data from VirusTotal. We will present the analysis of these burstiness, we designed a new cache-based mechanism that data in the next two sections. can support both offline and online analysis. Moreover, our LRU cache can predict malware families occurrences in the Metadata Field Explanation near future, allowing antivirus vendors to better focus their name submitted file name efforts. timestamp timestamp when the submission was made source country the country where the submission was made We focus our malware distribution study on detecting source id user ID that made the submission skewness of malware families and identifying hot families— size file size families that have many malwares. We observe that the dis- type file type tribution of malware families is highly skewed. To assist on- tags labels with more specific information for each type first seen when the same type of file was first submitted line analysis of malware family distribution, we apply a fre- last seen when the same type of file was last submitted quent item mining algorithm [12] to identify hot malware hashes hash value of the submitted file family. This solution can answer historical hot malware fam- total number of engines that analyzed the file ilies in nearly-real time using constant memory. positives number of engines that flagged the file as malicious positives delta changes in positives across different engine scans In summary, this paper makes the following contribu- report detailed detection report from each AV engine tions: Table 1: VirusTotal Metadata. Fields for each submission retrieved from the VirusTotal private API and their relation ex- • We are the first to collect mass data from the VirusTo- planation. The same file can be submitted multiple times by dif- tal repository for large-scale malware study. We collected ferent users. and preprocessed files submitted to VirusTotal in Novem- We downloaded the metadata and malware reports of all ber 2015 (Section 2). submitted files in November 2015 using the APIs VirusTotal 2 provides, resulting in a total of 43 million reports. Table 1 127 64-bit malwares in our sample set, and all other mal- shows the metadata fields and their meaning.

Learning from Big Malwares

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support