Learning from Big Malwares

Linhai Song1, Heqing Huang2, Wu Zhou3, Wenfei Wu1, Yiying Zhang4 1University of Wisconsin-Madison, 2IBM T.J. Watson Research Center, 3North Carolina State University, 4Purdue University

Abstract Second, VirusTotal applies a host of state-of-the-art an- This paper calls for the attention to investigate real-world tivirus engines to all submitted files, VirusTotal captures how malwares in large scales by examining the largest real mal- these engines evolve over time. ware repository, VirusTotal. As a first step, we analyzed two Third, VirusTotal provides rich metadata. Besides report- fundamental characteristics of Windows executable mal- ing whether or not a submitted file has malware, VirusTotal wares from VirusTotal. We designed offline and online tools also captures the exact type of malware, which antivirus en- for this analysis. Our results show that malwares appear in gines detected the malware, when and by whom the file is bursts and that distributions of malwares are highly skewed. submitted, and other useful metadata. There are also active malware researchers and engineers who comment and vote on each submitted file, providing valuable human inputs. 1. Introduction The VirusTotal repository exposes many new research Malwares grow exponentially [3]. AVTest [3] reports that opportunities. For example, studying malwares over a long more than 140 million new malwares appeared in 2015. Such time period and across all countries in the world can provide malwares are posing increasing threats to the human society a high-level insight into how malwares evolve over time and every day. For example, there were almost 2 million attempts over geo-locations. Studying how antivirus engines change to steal money from online bank accounts with malwares over time together with how malwares change can reveal the that exploit vulnerabilities in the Adobe Flash player [10]. effectiveness of responses to new security threats. Finally, it Researchers and practitioners continue to build security is interesting to investigate if we can build online malware tools to defend against new malwares. To assist the design prediction tools by applying machine learning techniques on of these tools, it is essential to understand malwares in the the VirusTotal data. All these insights could in turn assist real world. future antivirus researchers and engineers to better design Previous works on analyzing the behaviors and evolu- malware defense mechanisms. tions of malwares [8, 16] have provided insights into how Unfortunately, there has been little work in looking at this malwares circumvent the detection from existing antivirus valuable repository. In industry, many antivirus vendors use techniques and how malware writers create new malwares. VirusTotal to identify false negatives and false positives in However, these works only studied a limited amount of mal- their products. However, they only use VirusTotal to exam- wares that are targeted for certain types of security threats or ine suspicious files separately, and do not consider correla- antivirus engines. tions among different suspicious files. In academia, only un- Studying malwares in a large scale and with high diver- til recent did researchers begin to pay attention to mining the sity, what we call big malwares, can expose new insights VirusTotal repository. Graziano et al. [7] leveraged VirusTo- beyond these isolated studies. tal user ID information to identify malware writers who use VirusTotal [2] is a popular online service that real-world VirusTotal as a test platform. users use to analyze suspicious files and URLs. VirusTotal We propose to investigate in the VirusTotal repository does not judge whether a submission is malware by itself. In- along two directions: offline study of its rich data and meta- stead, it applies state-of-the-art antivirus engines from more data, and online analysis and prediction of malwares. Our than 50 vendors to each submitted file, and generates a sum- study can provide better understanding of malwares distri- mary report that includes the detection results of all these bution and appearance, and can help malware researchers engines. VirusTotal saves and provides an open access to all focus their effort, when designing anti-virus techniques. user-submitted files and generated reports. As a first step, we collected one month of VirusTo- The VirusTotal repository provides a valuable resource to tal repository data with more than 40 million suspicious gain insights into the behavior of malwares. First, it contains files and conducted an early-stage empirical study on them. a huge amount of real-world files. For example, there were VirusTotal supports downloading both data files and meta- more than 40 million suspicious files submitted in November data of them. In this study, we focus on studying metadata 2015 (Figure 1). Files from VirusTotal were submitted by and found that just by analyzing metadata, we can already real-world users from all over the world since 2004 and find a fair amount of valuable conclusions. Studying other involve various security threats. This amount of diverse data data could potentially provide more insights and we leave it makes VirusTotal a good representative of malwares in the for future work. real world.

1 tions: memory. constant using time malware nearly-real fam- in hot malware ilies hot identify historical answer to can solution [12] This family. algorithm fre- a mining apply item we distribution, quent family on- malware assist of To analysis skewed. line dis- highly is the families that malware observe of We tribution malwares. many have that families— families hot identifying and families malware of skewness their focus better to efforts. vendors the antivirus in allowing occurrences future, families near our malware Moreover, predict analysis. can online cache LRU and malware offline that understand both mechanism support To cache-based can new time. a in designed occur we the burstiness, family study same belonging the malwares then to closely We malwares—how families. of and malware burstiness frequencies of rates submission generation the including characteristics mecha- analysis online and nisms. offline both investigate dimen- we each sion, In distributions. malware and malwares of erties PE on results precise generate shows files. can experience engines previous all our that because on Microsoft by engines, attention detected antivirus initial malwares (PE) our executable focus Windows We behaviors. common families 2015. November in VirusTotal to malwares submitted of number the and and files, PE files, files of number malwares. The 1: Figure • ecnee u nlssaon h ocp of concept the around analysis our centered We nsmay hspprmkstefloigcontribu- following the makes paper this summary, In detecting on study distribution malware our focus We malware general with analysis temporal our began We prop- temporal dimensions: two on analysis our focus We e 05(eto 2). (Section 2015 Novem- ber VirusTo- in VirusTotal to the submitted files from preprocessed collected data and We study. mass malware large-scale collect for repository to tal first the are We VirusTotal reports (in million) 0.0 0.5 1.0 1.5 2.0 2.5 awr aiyi e fmlae hthave that malwares of set a is family malware A . 11-02-15 11-04-15 PE Malwares PE Submissions Submissions 11-06-15 h ubro suspicious of number The 11-08-15 11-10-15 11-12-15 11-14-15 11-16-15 11-18-15 11-20-15 11-22-15 11-24-15 11-26-15 11-28-15 11-30-15 hi itiuin o l umsin on 2015. November submissions in VirusTotal all for distributions submis- their all 2015. for November in types sions File 2: Figure Audio+Video Text 4% PDF 6% 3% ZIP 7% Image Web page 4% Office 3% 7% malware Android 7% Java 3% 2 unknown 11% aafo iuTtl ewl rsn h nlsso these sections. of two analysis next the the in present preprocessed data will and We VirusTotal. collected from we data how Properties discusses Basic section and This Collection Data 2. umte lsi oebr21 sn h PsVirusTotal APIs the using 2015 November in files submitted dif- by times multiple submitted users. be ex- ferent can relation file their same and The API planation. private VirusTotal the from retrieved Metadata. VirusTotal 1: Table Other • • • report positives positives total hashes last first tags type size source source timestamp name Field Metadata 5% 40% edwlae h eaaaadmlaerprso all of reports malware and metadata the downloaded We PE eietf eea e eerhopruiiso o of 5). top (Section on repository VirusTotal opportunities the research key several identify We (Sec- families malware 4). hot tion identifies that fam- tool malware mining malwares. hot ily few a built a we include observation, this only Leveraging others are while malwares malwares of of distributions skewed, family highly that observe We 3). (Section precision prediction greater 90% achieves than tool This tool. cache- prediction new malware a based using malwares of burstiness the studied We ietpsand types File seen seen id country delta ealddtcinrpr rmec Vengine AV each from report detection detailed malicious as file in the changes flagged that engines file of the number analyzed that engines of file number submitted submitted the last of was value file hash of submitted type first same was the file when of type same each the for when information specific more with labels type file size submission file the made that made ID was user submission the where country made the was submission the when timestamp name file submitted Explanation i.e. noec o2bsdsz aeoyin category size 2015. log2-based November each into November in 2015. VirusTotal mal- on for wares distribution Size 3: Figure oefml a ueamount huge a has family some ,

positives # of malwares (in million) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 h ubro awrsfalling malwares of number The 5 10 cosdfeetegn scans engine different across ilsfrec submission each for Fields File size(log2based) 15 20 25 type 30 provides, resulting in a total of 43 million reports. Table 1 127 64-bit malwares in our sample set, and all other mal- shows the metadata fields and their meaning. wares are 32-bit. Figure 2 shows file type distributions for all submis- Similar to all previous empirical studies, all our findings, sions. Portable Executable (PE) files have the largest num- experimental results, and conclusions need to be considered ber of submissions. For around 11.5% submissions, Virus- with our methodology in mind. Total cannot figure out their file types. Malwares on Android We only use one month of data as our first step in study- have the third largest number of submissions. Other popular ing VirusTotal. We leave investigating longer period of time file types include web pages, compressed files, pdf, images, for future work. The VirusTotal APIs only track which sub- and so on. mission reports are sent to each downloader approximately, In this paper, we only focus on PE files and leave the and there is no guarantee that all submission reports on analysis of other types of malicious files for future work. VirusTotal can be downloaded successfully. Thus, it is possi- We filter all downloaded metadata by the type field. If the ble that we missed some malwares submitted to VirusTotal. type field is either “Win32EXE” or “Win32DLL” tag, we Also, we only leverage Microsoft antivirus engines to decide consider the record as a PE file. Antivirus engines may dis- whether or not a submission is malicious, and it is possible agree with each other, so we only rely on Microsoft antivirus that Microsoft antivirus engines cannot make this decision engine to judge whether a PE file is a malware, because Mi- precisely. How to get a precise label for a PE file is out of crosoft has a very good reputation in detecting PE malwares. the scope of this paper. Although there is a huge amount of Within the 43 million reports, 4.7 million are PE malwares. malwares on VirusTotal, we believe that there are malwares The number of reports, PE reports, and PE malwares sub- never submitted to VirusTotal, and there are malwares sub- mitted each day are shown in Figure 1. mitted much later than when they appear in the real world. Malwares can be classified into related clusters, accord- Since there are no conceivable ways to study these malwares, ing to their dynamic behaviors and static features. These we believe that the malwares in our study provide a represen- clusters are referred to as malware families in our paper. The tative malware sample of the real world. Microsoft antivirus engine classifies malwares into different names [5]. We utilize this naming mechanism to decide what 3. Malware Temporal Properties family a malware belong to. Specifically, if two malwares have the same name under Microsoft engine, we consider This section presents our study of the temporal properties them to be from the same family. of VirusTotal malwares and answers two fundamental ques- One caveat of VirusTotal is that it is possible that the tions: how many malware families appear everyday, and VirusTotal APIs return redundant reports for the same sub- whether or not malwares occur in bursts. To answer the mitted file. We use the combination of sha256 hash value last question, we design a new caching mechanism that can and timestamp to detect and remove redundant reports. be used for both offline and online malware predictions. After removing redundant reports, we find that most mal- This cache-based malware prediction technique can predict wares were submitted only once to VirusTotal in November which malware families will appear in the near future with 2015. Out of the total 4.7 million PE malware submissions, 4 high precision. million are distinct. On average, each PE malware was sub- We first study how many new malware families appear mitted 1.17 times to VirusTotal in November 2015. This ob- everyday. Figure 4 shows the number of new malware fam- servation is in contradictory to the common belief that most ilies appearing on each day in November 2015. Since we malwares are encountered by more than one user. We sus- do not include data before November 1, there are more new pect that the reason behind this low degree of repeated sub- malware families in the first few days. After that, the number missions by different users in one month is that VirusTotal of new malware families becomes stable, falling into a range users tend to check whether their suspicious files have al- between 100 and 400. In total, there are 11302 malware fam- ready been submitted recently and avoid submitting redun- ilies in this time period. dant files. Observation 1: 100-400 new malware families appear Figure 3 shows the file size distribution for VirusTotal each day. malwares. The smallest malware is only 704 bytes, and the Next, we investigate whether malwares behave temporal largest one is more than 502 MB. 95.3% of malwares fall locality. Temporal locality is an important metric that can into the range from 16 KB to 2 MB. VirusTotal does not pro- guide the prediction of near-future malwares. vide tags to differ 64-bit malwares from 32-bit malwares di- Specifically, we analyze how bursty malwares in the same rectly. We sample 10000 malwares and download their exe- family appear. Inspired by previous usage of cache mecha- cutable binaries from VirusTotal. We apply Linux command nisms in predicting bugs [11], we design a new cache mech- file to each sampled malware binary. 64-bit malwares are la- anism to study the burstiness of malwares. beled with PE32+ by file command. In total, there are only This caching mechanism can be used to analyze histori- cal malware reports offline as well as to analyze online mal- wares and predict near-future malware occurrences.

3 h xlrto fteefc fohrccecngrtosfor configurations work. cache other future of no leave effect We one, the policy. of be exploration replacement setting to the LRU cache size the block simple use cache a and the prefetching, use fix We We evaluation. full. our is in cache entries when what evict controls caches to policy together. into replacement entries Cache cache advance. close from in spatially evicted loads and prefetching into Cache inserted management, are cache con- entries size, of line granularity cache or the size, block trols cache The to cache. parameters a several in are tune there caches, buffer system file and locality. temporal exhibit implies malwares rate of hit occurrence cache high the a that Therefore, cache. the cur- in is entry that an family a cache, to the belongs in cache report rently a new Thus, the file). belongs that the of means it time hit family submission (the (the time address a ( and input an to) Each with cache. associated a is into report) stream the feed and puts 2015. November in different countries from submitted malwares of tion 2015. in November countries different from sub- malware missions of Skewness 7: Figure 2015. ob- November in we day families every served malware new of number 2015. on November families in VirusTotal malware New 4: Figure # of new malware families appeared

CDF of Malwares 1000 1200 1400 1600 1800 100 400 800 600 200 swt eea ah ehnsssc sCUcaches CPU as such mechanisms cache general with As in- of stream a as reports submission malware view We 20 40 60 80 0 0 0 11-02-15 10 11-04-15

20 11-06-15 11-08-15 30 11-10-15 % ofcountries

40 11-12-15 11-14-15 i.e. uuaiedistribu- Cumulative 50 11-16-15 h e nu a h aeadesas address same the has input new the , 60 11-18-15 11-20-15 70 11-22-15

80 11-24-15 11-26-15 90 11-28-15

100 11-30-15 The ahmlaefml nNvme 2015. November in in family malwares malware each of 2015. distribution November Cumulative in appearing fam- ilies malware of Skewness 8: Figure 10 from 1000. size to cache of values different der size. cache and hit rate cache between Relation 5: Figure

i.e. CDF of malwares Cache hit rates (%) 100 100 20 40 60 80 40 80 60 20 0 0 o many how , 0 0 100 10 i.e. 200 20 each , % ofhottestmalwarefamilies The numberofcacheentries 300 30 400 40 4 500 50 ah i aeun- rate hit Cache 600 60 e fmlaefmle,teccehtrt ie bv 95%. num- above total rises rate the hit of cache 3% the than than families, more malware less of using is ber the when which families, and entries, malware 90%, cache above of 230 rises number rate is total to hit which the 69.31% cache entries, of from cache 1% grows 80 than rate than less 1000. hit more to using the 10 When 5, of from 98.14%. Figure size number in cache the shown the with As change changes We rate entries. hit cache cache the how plores mem- GB 30 and CPUs virtual ory. virtual c4.4xlarge 16 AWS contains an which machine, on experiments conducted full, and is list. cache the the of end If the list. and at entry entry entry cache cache the the evict new of we a front create the malware we to new it cache, the add the If cache in list. hit entry not the cache is our move family of we front mal- cache, the the the to if in entry submission, already file is new family a ware For cache. empty an with 700 70 ecnutdtoeprmns h rteprmn ex- experiment first The experiments. two conducted We Python using cache family malware this implemented We start We follows. as works cache family malware Our 800 80 900 90 100 1000 tteedo vr day. every content of cache end update the at only we if 2015 ber 2015. November in rate hit Cache 6: Figure a hne ihtecag of change the 10 with changes lap overlap. of gree between Relation 9: Figure

− Cache hit rate (%)

Degree of overlap (%) 100 100 40 80 60 20 1 75 80 85 90 95 0 to 1 0 ah i aeeeydyi Novem- in day every rate hit Cache 1

10 11-04-15

− 11-06-15

3 11-08-15 . 11-10-15 11-12-15

o h ereo over- of degree the How 11-14-15 1 1

/ 11-16-15 0 φ 2 11-18-15 11-20-15 11-22-15 11-24-15 φ 11-26-15 n de- and

φ 11-28-15

1 11-30-15 0 from 3 The high cache hit rate and the small cache size confirm that dataset can cause significant memory space and performance malwares occur in bursts. overhead. Observation 2: The occurrence of malwares in each fam- To reduce these overheads and better support online anal- ily has strong temporal locality. ysis of huge datasets, we need to seek alternative mecha- To support online malware occurrence prediction, it is es- nisms. The skewness of malware families indicates we can sential to lower the performance overhead of running our apply a frequent item mining algorithm to identify hot mal- cache mechanism. To this end, we lower the cache con- ware families. tent update frequency from once per malware report to once Frequent item mining algorithms take two configuration per day. That is, we keep cache content unchanged to count parameters, φ and ε, where φ > ε. The goal of frequent item cache hits and cache misses each day and update the cache mining algorithms is to provide a nearly-real time analysis content at the end of each day. In this second experiment, on massive data streams by using constant memory. Assum- we fix the cache size to 200. Figure 6 shows prediction rate ing the length of the input stream is N, the output of frequent by using next days data. Most cache hit rate is above 70%, item mining algorithms includes all items that appear more showing that even when lowering performance overhead, than bφNc times and not include any item that appears less i.e., updating cache content once a day, our cache mech- than bεNc times. anism still achieves a good estimation of malware occur- The frequent item mining algorithm we use is a space- rences. We do not need to combine malwares encountered efficient algorithm [12] proposed for streams in Internet ad- in each client site and send out updated cache content fre- vertising and that has already been applied in other areas, quently. Cache-based malware prediction can be used in an like mining hot calling contexts in profilers [6]. Space sav- online scenario. ing algorithm tracks M = 1/ε pairs of ( f ,c). f is short for We also study which malware families cause more cache malware family, and c is short for counter. The content of misses and have bad prediction results. We count cache these pairs represents (φ,ε)-HMF (Hot Malware Family). misses for each malware family, and consider a family with The M pairs are initialized with the first M encountered mal- more than half malwares causing cache misses as family ware families and their frequency. When a new malware with bad prediction results. We find that whether a family submission arrives, if the malware family is already being has bad prediction results is related to the number of mal- monitored, the related counter will be increased by 1. And wares it contains. When cache size is 100, the largest num- if the malware family is not being monitored, we will re- ber of malwares contained in one family with bad prediction place the malware family of the pair with the lowest counter results is 2307. When cache size is 1000, the largest num- value with the incoming malware family and increase its ber of malwares contained in one family with bad prediction counter value by 1. When querying HMF, all malware fam- results changes to 126. ilies whose counter values are larger than bφNc will be re- turned. We implement the space saving algorithm using Python 4. Malware Distribution and conduct experiments in the same system as we did in Section 3. Following previous works on frequent item min- This section presents our study on how malwares distribute ing [6], we measure the following metrics by using the mal- in countries and malware families. ware submission data we collect: Other than around 0.6 million malwares that VirusTotal does not provide submission countries, all other malwares 1. Degree of overlap is used to measure the percentage of are submitted from 164 countries. The top 5 countries in malwares covered in (φ,ε)-HMF, and it is defined as submission amount are Canada, USA, China, France, and follows: 1 Germany. As shown in Figure 7, malware submissions are overlap((φ,ε)-HMF) = w( f ) N ∑ also highly skewed among different countries. f ∈(φ,ε)-HMF We count how many malwares fall into different malware families. As shown in Figure 8, only a small number of where w( f ) represents the real frequency of malware malware families are hot, i.e., containing a large amount of family f . malwares. The distributions of malware families are highly 2. MaxUncover is short for maximum frequency of uncov- skewed too. ered malware families. It is defined as follows: Observation 3: Distributions of malware families are maxUncover((φ,ε)-HMF) = max w( f )/H( f ) highly skewed in countries and malware families. f ∈/(φ,ε)-HMF The highly skewed distribution of malware families makes it easy to precisely identify hot malware families. where H( f ) is the maximum frequency of all malware While the simple bin-counting mechanism works well families. on our one-month testing data, a total of 11302 distinct 3. False positives are defined as malware families returned malware families, applying the same mechanism over a large when querying HMF but whose real frequencies are less

5 101 100

3 100 10 80

10-1 60 102 40 10-2

maxError 1 20 10-3 10 False positives (%)

-4 0 Degree of maxUncover (%) 10 1 2 3 2 4 6 8 10 12 14 10 10 10 100 1/φ 101 102 103 φ/² 1/φ

Figure 10: Relation between φ and Figure 12: False positives. False posi- Figure 11: Relation between φ and degree of maxUncover. How the de- tives in (φ,ε)-HMF as a function of ε. The maxError. How maxError changes with gree of maxUncover changes with the value of φ is fixed to 10−2. The value of the change of φ from 10−1 to 10−3. change of φ from 10−1 to 10−3. φ/ε changes from 2 to 15.

than bφNc. Space saving algorithm is designed to guar- 5. Research Opportunities antee that there will be no false negatives. Although our initial investigation of VirusTotal is successful, 4. MaxError is used to measure the relative error of counter there are a many more research opportunities beyond our values compared to their real frequencies. It is defined as initial study. follows: Studying correlations among metadata fields. We cur- |c( f ) − w( f )| maxError((φ,ε)-HMF) = max rently only leverage timestamp and Microsoft detection re- f ∈(φ,ε)-HMF w( f ) ports. There are many other metadata fields. Conducting data There are two configuration parameters in space saving mining on these fields, especially their correlations, can en- algorithm: φ and ε, and the number of monitored ( f ,c) pairs able many other “big malwares” applications. For example, is directly controlled by ε. Following previous experience in future research can explore correlations between the features applying space saving algorithm [6], we set ε = φ/5 as the of malwares and their detection rates to understand what fea- default, unless we explicitly state otherwise. tures are ignored by existing antivirus engines. We first evaluate how the degree of overlap would change Utilizing other antivirus vendors’ reports. We cur- with the change of φ. The degree of overlap is used to rently only leverage reports from Microsoft, but these reports describe how many malwares are monitored in (φ,ε)-HMF, may not always be accurate. There are more than 40 antivirus and the larger it is, the better. As shown by Figure 9, after vendors’ reports provided on VirusTotal. Some vendors re- we change φ from 1/10 to 1/100, the degree of overlap ports may be better than others or better than others under increases from 79.37% to 95.51%. The degree of overlap some special conditions. We leave efforts evaluate all those further increases to 99.70% after we change φ to 1/1000. reports systematically and to better combine reports from We then study how maxUncover would change with the different vendors for future work. change of φ. MaxUncover is used to describe malwares not Studying other types of malicious files. Besides PE monitored in (φ,ε)-HMF, and the lower it is, the better. As files, there are other types of malicious files on VirusTo- shown by Figure 10, maxUncover decreases by an order as tal such as malicious apps, URL, and binary files on non- we increase the φ value by an order. Windows OSes. How these malicious file behave remains an Figure 11 shows how maxError changes after we open question. change φ. maxError describes how precise the counters in Using other analysis granularity. Currently, we use (φ,ε)-HMF are, and the lower it is, the better. maxError malware family as prediction granularity. One can predict ssdeep drops from 2177 to 998 after we change the value of φ from malwares using finer granularities. For example, val- 1/10 to 1/100. maxError value becomes 10 after we change ues are also provided in VirusTotal metadata, and these val- the value φ to 1/1000. The large maxError value is due to ues can be used to cluster malwares. One promising ap- the fact that space saving algorithm will conservatively as- proach is to cluster malwares in each family first, and then sume that the frequency of a new malware family is one use cluster as prediction granularity. larger than the smallest counter value of all monitored mal- Leveraging other information on VirusTotal. Besides ware families. the static information discussed in Section 2, VirusTotal also In the last experiment, we fix φ to 10−2 and change φ/ε hosts some behavior data for each malware sample. Mining from 2 to 15 to evaluate how false positives would change. these data can help us understand which behaviors are more As shown by Figure 12, space saving algorithm constantly prominent in malwares, and which vulnerabilities are more reports 0 false positives in our experiments.

6 likely to be used by malwares, both of which can be used as be detected by antivirus engines. These techniques utilize indicator to detect new maliciousness. submission id information, which is different from the infor- Training machine learning models by using VirusTo- mation we use. We believe other information on VirusTotal tal data VirusTotal provides a huge set of labeled mal- could also be leveraged in the future. wares. It is interesting to consider leveraging VirusTotal data to train a machine learning model, and applying the 7. Conclusion trained model to conduct malware detection and classifica- VirusTotal provides a fruitful opportunity to understand real- tion. However, we need to figure out a feature set before world malwares in a large scale. Unfortunately, it has been training the model. Which information provided by Virus- largely overlooked by the research community. In this paper, Total can be included into the feature set remains an open we conduct an empirical study on PE malwares on VirusTo- question. If we need features beyond information provided tal and analyze their temporal and family distribution char- by VirusTotal, does the feature extraction scales with the acteristics. We expect our work to deepen our understanding size of VirusTotal data also remains an open issue. Neural of and bring more attention to the data on VirusTotal. network takes binary as inputs, and can extract features au- tomatically. However, neural network requires all binary in- References puts with the same size. It is easier to resize images. If we want to apply neural network to malwares, how could we [1] learnbigcode. URL: http://learnbigcode.github.io/. resize malware? [2] VirusTotal. URL: https://www.virustotal.com/. Improving antivirus products. Finally, there are oppor- [3] AV-TEST. Malware Statistics. URL: https://www.av- tunities to improve existing antivirus products. For example, test.org/en/statistics/malware/. many endpoint , like ClamAV, are built [4] P. Bielik, V. Raychev, and M. Vechev. Programming with “Big based on a database of malwares’ signatures. When these Code”: Lessons, Techniques and Applications. In SNAPL, antivirus software scan suspicious files, all signatures in the 2015. database will be checked. It is possible to explore how to au- [5] M. M. P. Center. Naming malware. URL: tomatically extract malware signatures by leveraging Virus- https://www.microsoft.com/security/portal/mmpc/shared/mal- Total data, instead of extracting them manually. And if we warenaming.aspx. can precisely predict which malware will appear in the near [6] D. C. D’Elia, C. Demetrescu, and I. Finocchi. Mining hot call- future, we could reduce the size of signatures sent to clients’ ing contexts in small space. In Proceedings of the 32Nd ACM side and also reduce time to check the signature database. SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, pages 516–527, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0663-8. . URL http: 6. Related works //doi.acm.org/10.1145/1993498.1993559. Many research efforts [1, 4, 9, 13, 14] have been made [7] M. Graziano, D. Canali, L. Bilge, A. Lanzi, and D. Balzarotti. to explore how to leverage “big code” repositories, such Needles in a haystack: Mining information from public dy- as GitHub, BitBucket, and CodePlex, and these works in- namic analysis sandboxes for malware intelligence. In spire us to explore how to leverage data on VirusTotal. Proceedings of the 24th USENIX Conference on Secu- SLANG [13] can fill uncompleted programs with call inno- rity Symposium, SEC’15, pages 1057–1072, Berkeley, CA, vations by using statistical models trained from extracted se- USA, 2015. USENIX Association. ISBN 978-1-931971- quences of API calls from large code bases. JSNICE [14] can 232. URL http://dl.acm.org/citation.cfm?id= 2831143.2831210. predict identifier types and obfuscated identifier names for Javascript programs. JSNICE translates programs into de- [8] A. Gupta, P. Kuppili, A. Akella, and P. Barford. An em- pendence graphs and learns a CRF model by using a large pirical study of malware evolution. In Proceedings of the First International Conference on COMmunication Systems training set. All predictions are made by optimizing a score And NETworks, COMSNETS’09, pages 356–365, Piscat- function based on the learned CRF model. Karaivanov et al. away, NJ, USA, 2009. IEEE Press. ISBN 978-1-4244-2912- [9] apply phrase-based statistical translation approaches to 7. URL http://dl.acm.org/citation.cfm?id= translate C# programs to Java. To sum up, all these tech- 1702135.1702182. niques are built based on source code repositories, and their [9] S. Karaivanov, V. Raychev, and M. Vechev. Phrase-based sta- goals are to improve the development stage. However, Virus- tistical translation of programming languages. In Proceed- Total is a repository containing binary malwares and the goal ings of the 2014 ACM International Symposium on New Ideas, for conducting data mining on VirusTotal data is to improve New Paradigms, and Reflections on Programming & Soft- antivirus techniques. ware, Onward! 2014, pages 173–184, New York, NY, USA, There are existing works [7, 15] regarding the conducting 2014. ACM. ISBN 978-1-4503-3210-1. . URL http: of data mining on VirusTotal data to identify malware de- //doi.acm.org/10.1145/2661136.2661148. velopment cases, where malware writers use VirusTotal as [10] Kaspersky. Kaspersky Security Bulletin 2015 . a testing platform and try to develop malwares that cannot URL: https://securelist.com/analysis/kaspersky-security-

7 bulletin/73038/kaspersky-security-bulletin-2015-overall- //doi.acm.org/10.1145/2594291.2594321. statistics-for-2015/. [14] V. Raychev, M. Vechev, and A. Krause. Predicting program [11] S. Kim, T. Zimmermann, E. J. Whitehead Jr., and A. Zeller. properties from ”big code”. In Proceedings of the 42Nd Predicting faults from cached history. In Proceedings of Annual ACM SIGPLAN-SIGACT Symposium on Principles of the 29th International Conference on Software Engineering, Programming Languages, POPL ’15, pages 111–124, New ICSE ’07, pages 489–498, Washington, DC, USA, 2007. IEEE York, NY, USA, 2015. ACM. ISBN 978-1-4503-3300-9. Computer Society. ISBN 0-7695-2828-7. . URL http: . URL http://doi.acm.org/10.1145/2676726. //dx.doi.org/10.1109/ICSE.2007.66. 2677009. [12] A. Metwally, D. Agrawal, and A. E. Abbadi. An integrated [15] K. ZETTER. A Site Meant to Pro- efficient solution for computing frequent and top-k elements tect You Is Helping Hackers Attack You. URL: in data streams. ACM Trans. Database Syst., 31(3):1095– https://www.wired.com/2014/09/how-hackers-use-virustotal/. 1133, Sept. 2006. ISSN 0362-5915. . URL http://doi. [16] Y. Zhou and X. Jiang. Dissecting android malware: Charac- acm.org/10.1145/1166074.1166084. terization and evolution. In Proceedings of the 2012 IEEE [13] V. Raychev, M. Vechev, and E. Yahav. Code completion with Symposium on Security and Privacy, SP ’12, pages 95–109, statistical language models. In Proceedings of the 35th ACM Washington, DC, USA, 2012. IEEE Computer Society. ISBN SIGPLAN Conference on Programming Language Design and 978-0-7695-4681-0. . URL http://dx.doi.org/10. Implementation, PLDI ’14, pages 419–428, New York, NY, 1109/SP.2012.16. USA, 2014. ACM. ISBN 978-1-4503-2784-8. . URL http:

8