Arxiv:1710.10225V2 [Cs.CR] 14 Jul 2020 to Counteract Such Attacks, Researchers Proposed So- Such As Gradient-Descent)
Total Page:16
File Type:pdf, Size:1020Kb
Adversarial Detection of Flash Malware: Limitations and Open Issues Davide Maiorcaa,1,∗, Ambra Demontisa,1, Battista Biggioa,b, Fabio Rolia,b, Giorgio Giacintoa aDepartment of Electrical and Electronic Engineering, University of Cagliari, Piazza dArmi 09123, Cagliari, Italy bPluribus One, Italy Abstract During the past four years, Flash malware has become one of the most insidious threats to detect, with almost 600 critical vulnerabilities targeting Adobe Flash Player disclosed in the wild. Research has shown that machine learning can be successfully used to detect Flash malware by leveraging static analysis to extract information from the structure of the file or its bytecode. However, the robustness of Flash malware detectors against well-crafted evasion attempts - also known as adversarial examples - has never been investigated. In this paper, we propose a security evaluation of a novel, representative Flash detector that embeds a combination of the prominent, static features employed by state-of-the-art tools. In particular, we discuss how to craft adversarial Flash malware examples, showing that it suffices to manipulate the corresponding source malware samples slightly to evade detection. We then empirically demonstrate that popular defense techniques proposed to mitigate evasion attempts, including re-training on adversarial examples, may not always be sufficient to ensure robustness. We argue that this occurs when the feature vectors extracted from adversarial examples become indistinguishable from those of benign data, meaning that the given feature representation is intrinsically vulnerable. In this respect, we are the first to formally define and quantitatively characterize this vulnerability, highlighting when an attack can be countered by solely improving the security of the learning algorithm, or when it requires also considering additional features. We conclude the paper by suggesting alternative research directions to improve the security of learning-based Flash malware detectors. Keywords: Adobe Flash, Malware Detection, Secure Machine Learning, Adversarial Training, Computer Security 1. Introduction namely, well-crafted manipulations of the input samples at test time [8, 9, 10] (also recently referred to as adversarial Malware detection is still a critical priority for re- examples in the context of deep learning [11, 12, 13]). The searchers and industry, as countless of polymorphic attacks efficacy of such attacks depends on the information that are continually being released [1]. In particular, a very in- the attacker possesses about the system, and on her ability sidious threat is represented by infection vectors, i.e., files to perform enough changes to the extracted features. that exploit vulnerabilities of third-party applications to The robustness of machine learning approaches against drop or execute malicious executables. Such vectors are adversarial attacks has been assessed in various case stud- typically documents (e.g., PDF, Office) or multimedia files ies. In particular, some works employed code obfuscation (e.g. Flash), and attackers leverage their structure to con- through off-the-shelf tools or custom techniques [14, 15, ceal scripting codes that exploit the target vulnerabilities. 16, 17, 18] to perform evasion. Other approaches directly For example, pages of PDF documents can hide malicious targeted the learning algorithms by employing black-box JavaScript codes or even additional executables [2, 3]. and white-box evasion techniques (e.g., by using algorithms arXiv:1710.10225v2 [cs.CR] 14 Jul 2020 To counteract such attacks, researchers proposed so- such as gradient-descent). Popular case-studies in which lutions based on machine-learning algorithms applied to these techniques have been systematically studied are the information extracted by employing static or dynamic detection of malicious PDF files [3, 19, 20, 8] and the de- analysis of the embedded code, with excellent results for tection of Android malware [21, 22, 23]. document-based infection vectors (e.g. [4, 5, 6, 7]). How- ever, research also showed that such detection approaches Multimedia malware, in particular in the Flash (also have several robustness issues against evasion attacks, known as Adobe Small Web Format - SWF) format, has been studied considerably less from the perspective of ad- versarial attacks (in particular, against white- and black- ∗Corresponding author box attacks). One of the reasons for such a lack of study is Email addresses: [email protected] (Davide Maiorca), that, albeit Flash malware caused a massive uproar start- [email protected] (Ambra Demontis), ing from 2015 (due to the significant increment of vulnera- [email protected] (Battista Biggio), [email protected] (Fabio Roli) bilities for Adobe Flash Player), its technology is going to 1Those authors contributed equally to this manuscript. be discontinued in 2020. Nevertheless, this technology is Preprint submitted to Elsevier July 15, 2020 still active from the perspective of malware-based attacks. evasion attacks entirely mimic the feature values of benign Flash-based attacks represent excellent examples of highly data, thus becoming indistinguishable for the learning al- obfuscated and evasive infection vectors. Critical vulner- gorithm. We define this vulnerability in formal terms, and abilities are still released for the Flash platform, and very quantitatively evaluate it by defining a specific metric that dangerous attacks are still possible, such as the one that measures the extent to which the attack samples converge targeted North Korea in 2018 [24]. towards benign data. We believe not only that it will take years before Flash Our findings highlight a crucial problem that must be completely stops being used, but also that analyzing Flash considered when designing secure machine-learning sys- attacks constitutes an excellent test-bench for machine tems, i.e., that of evaluating in advance the vulnerability learning-based detection. The main reason for this claim is of the given features. Indeed, vulnerable information may that most infection vectors (including the most used ones compromise the whole system even if the employed deci- at the moment - e.g., PDF, Office documents, etc.) share sion function is robust. In this respect, we sketch possible a similar organization, composed of an external structure research directions that may lead one to design more se- and a code-based content. Research on such infection vec- cure machine learning-based malware detectors. tors showed that the analysis of their structure and/or Finally, we note that we publicly released FlashBuster, content led to encouraging results in their detection, and together with the features employed in the experiments for the Flash format is no exception in this [2, 25, 26]. Hence, this paper.2 the lessons learned from the study of adversarial attacks The rest of this paper is structured as follows. Section 2 on formats like Flash can provide precious insights into the provides the basics to understand the SWF format and an robustness of systems to detect other file formats. example of ActionScript code. Section 3 describes the re- In this paper, we provide an in-depth analysis of lated work in the field. Section 4 describes the architecture Flash-based learning systems, focusing on their robustness of FlashBuster. Section 5 describes the threat model in re- against adversarial attacks. Our analysis aims to give in- lation to adversarial environments. Section 6 describes the teresting insights into how to design more secure systems evasion attacks employed in this paper. Section 7 discusses for detecting infection vectors, and to point out possible the vulnerabilities that affect learning-based systems, and vulnerabilities in the feature representations and the clas- introduces a quantitative measure of feature and learning sification algorithms. vulnerabilities. Section 8 provides the experimental eval- To this end, we first propose a representative system uation. Section 9 provides a discussion on the attained for Flash malware detection, named FlashBuster. It is a results. Section 10 closes the paper by sketching possible static machine-learning system that employs information future work. extracted by both the structure and the content of SWF files. Such an approach allows for a more comprehensive 2. ShockWave Flash File Format assessment of the extracted static information by repre- senting and combining the contents employed by previous Small Web Format (SWF) is a file type that efficiently state-of-the-art systems [25, 26]. We show that Flash- delivers multimedia contents, and it is processed by Adobe Buster could detect the majority of malware samples in Software such as Adobe Flash Player.3 the wild, by obtaining comparable performances to other SWF files are composed of three essential elements: (i) a systems at state of the art, and demonstrate that it can header that describes important file properties such as the predict previously unseen attacks. We also tested Flash- presence of compression, the version of the SWF format, Buster against popular obfuscation techniques, showing and the number of video frames; (ii) a list of tags, i.e., that our approach could also be employed in the presence data structures that establish and control the operations of obfuscated malware. performed by the reader on the file data; (iii) a unique tag We then evaluate FlashBuster robustness by simulat- called End that terminates the file. ing evasion attacks that leverage the knowledge that the