Measuring Relative Accuracy of Malware Detectors in the Absence of Ground Truth
Total Page:16
File Type:pdf, Size:1020Kb
Milcom 2018 Track 3 - Cyber Security and Trusted Computing Measuring Relative Accuracy of Malware Detectors in the Absence of Ground Truth John Charlton Pang Du Jin-Hee Cho Shouhuai Xu Depart. of Computer Science Depart. of Statistics Depart. of Computer Science Depart. of Computer Science UT San Antonio Virginia Tech Virginia Tech UT San Antonio Abstract—In this paper, we measure the relative accuracy of assumptions that were made in [4]. Although the relative accu- malware detectors in the absence of ground truth regarding the racy of malware detectors is weaker than the absolute accuracy, quality of malware detectors (i.e., the detection accuracy) or the it is still useful in regards to comparing malware detectors. class of sample files, (i.e., malicious or benign). In particular, we are interested in measuring the ordinal scale of malware Therefore, this paper aims at answering the following research detectors in the absence of the ground truth of their actual question: “How can we rank the accuracy of malware detectors detection quality. To this end, we propose an algorithm to in the absence of ground truth?” estimate the relative accuracy of the malware detectors. Based This work has the following key contributions: (i) This on synthetic data with known ground truth, we characterize study offers a method to formulate how to estimate the when the proposed algorithm leads to accurately estimating the relative accuracy of the malware detectors. We show the relative accuracy of malware detectors. This method can be measured relative accuracy of real-world malware detectors using used when one needs to choose one malware detector over our proposed algorithm based on a real dataset consisting of 10.7 others; and (ii) The proposed algorithm measuring the relative million files and 62 malware detectors, obtained from VirusTotal. detection accuracy of a malware detector is validated based on a real world malware dataset consisting of 62 detectors, given Index Terms—Malware detection, security metrics, security measurement, ground truth, estimation, statistical estimators. synthetic data with known ground truth values. The rest of the paper is organized as follows. Section II pro- vides the overview of the related state-of-the-art approaches. I. INTRODUCTION Section III presents a problem statement and our proposed Measuring security metrics is a vital but challenging open methodology to solve the given problem. Section IV describes research problem that has not been well addressed. The major the experimental setup and results, with the discussion of key problems in measuring security metrics are two-fold: (1) what findings. Section V concludes the paper and suggests future to measure, which questions how to define new, useful security research directions. metrics; and (2) how to measure, which asks how to devise II. RELATED WORK new methods to measure security metrics. In this work, we are interested in answering the latter question, how to measure a In practice, obtaining ground truth is highly challenging and security metric where the ground truth does not exist for the almost not feasible due to high noise, uncertain data, and/or detection accuracy of a malware detector as well as for the the inherent nature of imperfect estimation. For example, class (i.e., malicious or benign) of the files. when using machine learning techniques to train cyber defense When measuring the quality of malware detectors, many models, we need to know the ground truth labels of training methods have been used based on certain heuristics such samples. For a small set of samples, we can use human experts as using the labels of a few malware detectors as ground to grade them and derive their ground truth. However, even truth [1, 2, 3]. These heuristic-based approaches are trou- in this case, the perfect evaluation is not guaranteed because blesome because of the well-known fact that each malware humans are also error-prone and can make mistakes due to detector has a different quality of detection accuracy. Although their inherent cognitive limitations. Accordingly, for a large some methods have been proposed to answer how to measure set of samples, it is obviously not feasible for human experts security metrics [4, 5], measuring the relative accuracy of to derive the ground truth. Therefore, it is true that many third- malware detectors has not been addressed in existing works. party datasets (e.g., blacklisted websites [6, 7, 8, 9]) are not In particular, this work is inspired by our prior work [4] necessarily trustworthy due to these inherent limitations. which measured the quality of malware detectors assuming Several studies have been conducted in order to investigate that a voting-based estimation of detection accuracy is true. the accuracy of malware detectors when there exists no ground Unlike [4], this work aims to estimate the relative accuracy truth of their accuracy [4, 5, 10, 11]. These studies used of malware detectors, which are obtained without making the different assumptions in order to estimate the accuracy of the malware detectors. Kantchelian et al. [5] used the na¨ıve This work is done when Jin-Hee Cho was with US Army Research Bayesian method and treated the unknown ground truth labels Laboratory. as hidden variables, while the Expectation-Maximization (EM) 978-1-5386-7185-6/18/$31.00 ©2018 IEEE 450 Milcom 2018 Track 3 - Cyber Security and Trusted Computing method [10, 11] is used to estimate the accuracy of malware more interested in knowing which detector is more accurate detectors using well-known metrics such as false-positive rate, than others, leading to generating the ranks of examined false-negative rate, or accuracy. In [5], the authors assumed the detectors. Based on our proposed methodology, we obtain homogeneity of false positives in all detectors, an independent their respective relative accuracy as T1 = 100%, T2 = 90%, and decision of each detector, and low false positives but high false T3 = 70%, which gives the performance of relative accuracy of negatives assumed for all detectors, and so forth. However, the detectors: D1 > D2 > D3. However, the relative accuracy these assumptions should be removed in order to reflect real does not approximate the true accuracy. Moreover, when we world applications. Xu et al. [4] used a frequentist approach consider a set of files scanned by detectors, D1, D2, D3, to design a statistical metric estimator to measure the quality and D4, letting the true accuracy of D1-D3 remain the same metrics of malware detectors with only two of the four as- while the true accuracy of D4 is 95%, then the resulting sumptions made in [5]. All the above works [4, 5, 10, 11] make relative accuracy may be T1 = 80%, T2 = 60%, T3 = 50%, and certain assumptions. In contrast, the present paper investigates T4 = 100%. This leads to the relative accuracy of the detectors the relative accuracy of malware detectors without making being D4 > D1 > D2 > D3. This is because the detector with those assumptions. the highest relative accuracy is always normalized to have a The paper falls into the study of security metrics, which is relative accuracy of 100% and the relative accuracy is always an integral part of the Cybersecurity Dynamics framework [12, measured over a set of detectors. 13] and indeed one of the most fundamental open problems B. Methodology that have yet to be adequately tackled [14]. Recent advance- The basic underlying idea of estimating the relative accuracy ment in security metrics includes [14, 15, 16, 17, 18, 19]. For of malware detectors is to measure the similarity between each example, the effectiveness of firewalls and DMZs is studied in pair of detectors. To do so, we iteratively create the relative [18] and the effectiveness of enforcing network-wide software accuracy of the malware detectors, given that the initial relative diversity is investigated in [17]. Orthogonal to security metrics accuracy of each detector is set to 1, assuming that each research are the first-principle modeling of cybersecurity from detector is equally accurate. a holistic perspective [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30] 1) Similarity Matrix: To measure the relative accuracy of and cybersecurity data analytics [31, 32, 33, 34, 35]. detectors, the concept of a similarity matrix is introduced to III. PROBLEM STATEMENT AND METHODOLOGY collectively represent the similarity between malware detectors according to their decisions in labeling files as benign or A. Definitions of Relative Accuracy of Malware Detectors malicious. In this matrix, denoted by S = (Sik), the i-th row Suppose that there are m files, denoted by F1;:::;Fj;:::;Fm corresponds to detector Di and the k-th column corresponds and n malware detectors, denoted by D1;:::;Di;:::;Dn. The to detector Dk, where 1 ≤ i;k ≤ n. Element Sik denotes the input dataset is represented by a matrix V = (Vi j)1≤i≤n;1≤ j≤m, similarity between detectors Di and Dk in terms of their which is defined where capabilities in detecting malware. Naturally, we require (i) 8 Sik = Ski because the similarity should be symmetric; and (ii) >1 if Di detects Fj as malicious, < Sii = 1 for any 1 ≤ i ≤ n. Intuitively, the similarity between Di Vi j = 0 if Di detects Fj as benign, and Dk, Sik, is defined by the ratio of the number of decisions :> −1 if Di did not scan Fj. where Di and Dk agree with each other over the total number of files scanned by both detectors, D and D . Vector V = (V ;:::;V ;:::;V ), where 1 ≤ i ≤ n, represents i k i i1 i j im To clearly define S in a modular fashion, two auxiliary the outcome of using detector D to label m files. ik i matrices are defined: the agreement matrix, denoted by A = With respect to a set of n detectors, the relative accuracy of (A ) , and the count matrix, denoted by C = (C ) .