Malpedia: a Collaborative Effort to Inventorize the Malware Landscape
Total Page:16
File Type:pdf, Size:1020Kb
The Journal on Cybercrime & Digital Investigations, Vol. 3, No. 1, Dec. 2017 Botconf 2017 Proceedings Malpedia: A Collaborative Effort to Inventorize the Malware Landscape Daniel Plohmann1, Martin Clauß1, Steffen Enders2, Elmar Padilla1 1Fraunhofer FKIE, 2TU Dortmund It is published in the Journal on Cybercrime & Digital Investigations by CECyF, https://journal.cecyf.fr/ojs cb It is shared under the CC BY license http://creativecommons.org/licenses/by/4.0/. Abstract 1 Introduction For more than a decade now, a perpetual in- flux of new malware samples can be observed. It is a well-known fact that the number of registered To analyze this flood effectively, static analy- unique malware samples observed since 2005/2006 sis is still one of the most important methods. has drastically increased, and currently sits at almost Thus, it would be highly desirable to have an 700 million samples as e.g. tracked by AV Test [2]. open, freely accessible, curated, and cleanly la- This is a direct consequence of the common use of beled corpus of unpacked malware samples for so called packers, auxiliary programs that contain rou- research on static analysis methods. In this pa- tines such as compression or encryption algorithms per, we introduce Malpedia, a collaboration plat- used to alter the appearance of the respective payload form for curating a malware corpus. Addition- ally, we provide a baseline for a cleanly labeled code in order to protect it against detection or analy- malware corpus consisting of 607 families di- sis. In fact, it can be assumed that the actual number vided into 1792 samples. This corpus offers of malware families or versions is significantly below a plethora of possibilities for researchers, in- these figures. Nevertheless, these observations can cluding using it as a testbed for evaluations on be taken as a symbol for the massive gain in impor- detection and analysis methods, quality assur- tance of malware over the last decade, be it as a tool ance for classification, and contextualization of for digital crime or in the context of state-sponsored new malware. activities. To ensure the quality of our corpus, we The overwhelming number of samples and its fast- adapted the requirements by Rossow et al. [1], paced growth has strong implications on malware re- derive specific requirements for the context of static malware analysis, and evaluate our cor- search. They lead to a high demand for advanced pus against them. methods to detect, analyze, and contextualize mal- Based on our corpus, we show that looking ware. While researching such methods, it is of ut- beyond packers dramatically reduces the size termost importance to have representative data for needed for a corpus to be representative, as the evaluation available. Rossow et al. [1] have shown number of distinct malware families and ver- that in past academic studies there has been a lack sions after unpacking is orders of magnitude of comprehensiveness and representativeness in the smaller than the number of unique packed sam- data used. This renders academic research less ef- ples. Additionally, we perform a comprehen- fective. sive study of the Windows malware in the cor- Ideally, researchers would have access to an inde- pus, scrutinizing its structural features. This analysis clearly illustrates that Malpedia offers pendent, pooled resource that provides them with con- a wealth of information, readily available for in- fidently labeled, unpacked reference samples for mal- depth investigations. ware families and versions, alongside available meta information such as pointers to analysis reports or de- Keywords: malware corpus, malware analysis tection capabilities such as accurate YARA rules. Our Daniel Plohmann, Martin Clauß, Steffen Enders, Elmar Padilla. Malpedia: A Collaborative Effort to Inventorize the Malware Landscape 1 The Journal on Cybercrime & Digital Investigations, Vol. 3, No. 1, Dec. 2017 Botconf 2017 Proceedings evaluation shows that there is in fact huge redundancy of the platform that we have created to share the cor- in packed samples versus unpacked reference sam- pus with the community and with which we want to ples for families and versions. Thus, such a corpus curate it in the future. In Section5, we use the cur- could likely be distilled down to a couple thousand in- rent status of the data set to conduct a comparative stead of many hundred million files. study across various structural features, showcasing To address the above-mentioned situation, in this the usefulness of the Malpedia corpus for static anal- paper we introduce an independent platform called ysis. Section6 summarizes related work while Sec- Malpedia. It allows malware researchers to contribute tion7 concludes the paper. to a centrally curated, free corpus. Initially, it of- fers access to a corpus we created since 2015, span- ning around 1800 cleanly labeled samples represent- 2 Goals and Requirements for a ing more than 600 families for several platforms, in- cluding Android, macOS/iOS, Linux/ELF and Microsoft Malware Corpus focusing on Windows. Static Analysis To show the usefulness of Malpedia and the corpus in its current stage, we conduct a comparative study on several characteristics across the families of Win- In this section, we first define general goals that a mal- dows malware archived at the time of writing. This ware corpus focusing on static analysis should fulfil. study leads to the following core results: We then focus on requirements to ensure that a • Packers mostly serve as an initial barrier against corpus is created towards these goals. For this rea- detection. son, we recapitulate the guidelines to consider when • The number of unique unpacked samples is planning and executing malware experiments, as de- orders of magnitude smaller than the one of fined by Rossow et al. [1] in 2012. After checking their packed samples. applicability for this special case, we rephrase and ex- • Unpacked samples can be conveniently treated tend these aspects and combine them into our own set with methods of static analysis. of requirements. • The information extracted even with just cursory methods already gives an interesting insight in preferences and choices of malware authors. 2.1 Goals We implement our comparative analysis in such a way, that it can be continually executed over the grow- The primary goal of any malware corpus should be to ing data set in the future and provide updated results contain representative data. Ideally, it should cover via the web portal found at https://malpedia.caad. longer periods of time within the malware landscape, fkie.fraunhofer.de. in order to allow comprehensive measurements, re- In summary, we make the following contributions flecting changes over time. In that sense, it should also with our paper: naturally be updated as necessary to keep up with the • We define a set of requirements that malware frequent developments as encountered when dealing corpora intended for static analysis should fol- with malware. low. Furthermore, the corpus should provide the data in • We introduce a vetted collaboration platform an easily accessible format. We believe that providing with the mission to curate a reference mal- some verified unpacked format of the samples adds a ware corpus, providing unpacked and dumped huge benefit over just providing their original state (be samples to specifically address static analysis it packed or not). As the corpus described here is in- needs. tended to be specifically suited for static analysis, this • We provide a malware corpus for free to the re- allows to directly orientate analyses from a viewpoint search community. behind the classical packer-barrier. Just as important, • We perform a comprehensive, quantitative static a malware corpus should provide rich meta informa- analysis of structural features across 446 fami- tion, at the very least accurate labels, in order to better lies of Windows malware. judge and classify findings derived from it. Note that • We show that it is viable to assume that pack- the packer-barrier has a strong presence in essentially ers serve mostly as an initial barrier and beyond every malware feed or data set currently available, and this, the number of distinguishable families and meta information is usually limited to detection labels versions is magnitudes smaller than the number as provided by anti-virus software. of unique samples encountered in the wild. Another goal should be that such a malware corpus The remainder of this document is structured as is a resource equally useful for practitioners to ensure follows. Section2 specifies requirements which a mal- it is being reviewed for relevance from different per- ware corpus optimized for static analysis should fol- spectives. By providing vendor-neutrality in its cover- low. Section3 presents the way in which we have age, it could also serve as a consensual ground-truth implemented these requirements to build our corpus, among malware researchers, both for naming in the Malpedia, and provides an outline of its current sta- concrete case and as a source for the verification of tus. We use Section4 to give insight into the design identification measures in general. 2 Daniel Plohmann, Martin Clauß, Steffen Enders, Elmar Padilla. Malpedia: A Collaborative Effort to Inventorize the Malware Landscape The Journal on Cybercrime & Digital Investigations, Vol. 3, No. 1, Dec. 2017 Botconf 2017 Proceedings 2.2 Adoption of Prudent Practices Fourth and finally, safety and with that the contain- ment policy is mostly of importance for the further dis- Rossow et al. [1] defined their set of guidelines in a way semination of the resulting corpus (cp. 2.3.6). that they would be applicable for a wide range of ex- periments involving malware. The guidelines address evaluations of techniques for detection, classification, 2.3 Definition of Requirements and behavioral analysis, based on dynamic, static, or combined analysis methods. Having reviewed the prudent practices for reasonably This paper focuses specifically on the creation applicable components, we now formulate a set of re- of a malware corpus optimized for static analysis.