Identification of Security Related Bug Reports Via Text Mining Using

2018 IEEE International Conference on Software Quality, Reliability and Security Identification of Security related Bug Reports via Text Mining using Supervised and Unsupervised Classification Katerina Goseva-Popstojanova and Jacob Tyo Lane Department of Computer Science and Electrical Engineering West Virginia University, Morgantown, WV, USA Email: [email protected] Abstract—While many prior works used text mining for been used in the past to identify duplicates [1], classify automating different tasks related to software bug reports, few the severity levels [2], assign bugs to the most suitable works considered the security aspects. This paper is focused on development team [3], classify different types of bugs (i.e., automated classification of software bug reports to security and not-security related, using both supervised and unsupervised standard, function, GUI, and logic) [4], and topic modeling approaches. For both approaches, three types of feature vectors to extract trends in testing and operational failures [5]. Only are used. For supervised learning, we experiment with multiple several related works were focused on using text-based pre- classifiers and training sets with different sizes. Furthermore, diction models to automatically classify software bug reports we propose a novel unsupervised approach based on anomaly to security related and non-security related [6], [7], and detection. The evaluation is based on three NASA datasets. The results showed that supervised classification is affected [8]. Prediction models used in these works were based on more by the learning algorithms than by feature vectors and supervised machine learning algorithms that require labeled training only on 25% of the data provides as good results bug reports for training. Each of these works used only as training on 90% of the data. The supervised learning one type of feature vector and ten fold cross validation for slightly outperforms the unsupervised learning, at the expense prediction; none experimented with the size of the training of labeling the training set. In general, datasets with more security information lead to better performance. set and its effect on the classification performance. In this paper we propose both a supervised approach Keywords -software vulnerability; security bug reports; clas- and unsupervised approach that can be used by security sification; supervised learning; unsupervised learning; anomaly detection. engineers to quickly and accurately identify security bug reports. Specifically, for both approaches we use three types of feature vectors: Binary Bag-of-Words Frequency (BF), Term I. INTRODUCTION Frequency (TF), and Term Frequency-Inverse Document Issue tracking systems are used by software projects to Frequency (TF-IDF). For the supervised approach, we ex- record and follow the progress of every issue that developers, periment with multiple algorithms (i.e., Bayesian Network, testing personnel and/or software system users identify. k-Nearest Neighbor, Naive Bayes, Naive Bayes Multinomial, Issues may belong to multiple categories, such as software Random Forest, and Support Vector Machine), each in bugs, improvements, and new functionality. In this paper, combination with the three types of feature vectors. Unlike we are focused on software bugs reports (as a subset of the related works [6],[7], and [8], we use training sets with software issues) with a goal to automatically identify those different sizes to determine the smallest size of the training software bugs reports that are security related, that is, are set that produces good classification results. This aspect of related to security vulnerabilities that could be exploited our work has a practical value because the manual labeling by attackers to compromise any aspect of cybersecurity of the bug reports in the training set is a tedious and time (i.e., confidentiality, integrity, availability, authentication, consuming process. Furthermore, we propose, for the first authorization, and non-repudiation). time, an unsupervised approach for identification of security As the numbers of software vulnerabilities and cybersecu- bug reports. This novel approach is based on the concept rity threats increase, it is becoming more difficult to classify of anomaly detection and does not require a labeled training bug reports manually. In addition to the high level of human set. Specifically, we approached the classification problem as effort needed, manual classification requires bug reporters one-class classification, and classified bug reports similar to to have security domain knowledge, which is not always the descriptions of vulnerability classes from the Common the case. Therefore, there is a strong need for effective Weakness and Enumeration (CWE) view CWE-888 [9], [10] automated approaches that would reduce the amount of as security related. human effort and expertise required for identification of We evaluate the proposed supervised and unsupervised security related bug reports in large issue tracking systems. approaches on data extracted from the issue tracking systems Software bug reports contain title, description and other of two NASA missions. These data were organized in textual fields, and therefore text mining can be used for three datasets: Ground mission IV&V issues, Flight mission automating different tasks related to software bug reports. IV&V issues, and Flight mission Developers issues. We used For example, text mining of software bug reports have these three datasets in our previous work [11] to study the 978-1-5386-7757-5/18/$31.00 ©2018 IEEE 344 DOI 10.1109/QRS.2018.00047 profiles of the security related bugs reports based on the proposed supervised and unsupervised learning approaches, manual classification of each bug report to one of the twenty and the metrics used for evaluation of the performance. The one primary vulnerability classes from CWE-888 [10]. In datasets and the manual labeling process used as ground this paper we use the manual classification from our previous truth for evaluation of the learning performance are de- work [11] as labels for the training sets in the case of scribed in section IV. The results of the supervised and supervised learning and as ground truth for evaluation of unsupervised learning and their comparison are detailed in both the supervised and unsupervised learning approaches. section V, followed by the description of the threats to Specifically, we address the following research questions: validity in section VI. The paper is concluded in section VII. RQ1: Can supervised machine learning algorithms be used to II. RELATED WORK successfully classify software bug reports as security related or non-security related? Issue tracking systems contain unstructured text, and RQ1a: Do some feature vectors lead to better classifica- therefore text mining can be used to automatically process tion performance than other? data from such systems. Multiple papers applied text mining RQ1b: Do some learning algorithms perform consistently approaches on bug reports, and were focused on different better than other? aspects such as identification of duplicates [1], classification RQ1c: How much data must be set aside for training in of severity level [2], assignment of bugs to the most suitable order to produce good classification results? development team [3], classification of issues to bugs and RQ2: Can unsupervised machine learning be used to classify other activities [12], [13], classification to different types software issues as security related or non-security re- of bugs (i.e., standard, function, GUI, and logic) [4], and lated? topic modeling to extract trends in testing and operational RQ3: How does the performance of supervised and unsu- failures [5]. None of these works considered security aspects pervised machine learning algorithms compare when of software bugs. classifying software bug reports? Several works treated the source code as textual document and used text mining to classify the software units (e.g., The main findings of our work include: files or components) as vulnerable [14], [15]. Hovsepyan • Multiple learning systems, consisting of different com- et al. extracted feature vectors that contained the term binations of feature vectors and supervised learning frequencies (TF) from the source code and used SVM to algorithms, performed well. The level of performance, classify which files contain vulnerabilities [14]. The dataset however, does depend on the dataset. used in that work was the source code of the K9 mail client – Feature vectors do not affect significantly the clas- for Android mobile device applications. The static code sification performance. analysis tool Fortify [16] was used to label the source code – Some learning algorithms performed better than vulnerabilities and the following classification performance others, but the best performing algorithm was metrics were reported: recall of 88%, precision of 85%, different depending not only on the feature vector, and accuracy of 87%. Note that these performance metrics but also on the dataset. In general, the Naive Bayes were not with respect to the true class, but were based on algorithm performed consistently well, among or comparison with the labels assigned by Fortify. However, it close to the best performing algorithms across all is known that static code analysis tools do not detect 100% feature vectors and datasets. of vulnerabilities and have a very high false positive rate – The supervised classification was just as good with [17]. In

Identification of Security Related Bug Reports Via Text Mining Using

BUGS in the SYSTEM a Primer on the Software Vulnerability Ecosystem and Its Policy Implications

How to Analyze the Cyber Threat from Drones

The CLASP Application Security Process

Threats and Vulnerabilities in Federation Protocols and Products

Fuzzing with Code Fragments

Paul Youn from Isec Partners

Designing New Operating Primitives to Improve Fuzzing Performance

Empirical Analysis and Automated Classification of Security Bug Reports

Functional and Non Functional Requirements

Software Bug Bounties and Legal Risks to Security Researchers Robin Hamper

Detecting Lacking-Recheck Bugs in OS Kernels

Identifying Software and Protocol Vulnerabilities in WPA2 Implementations Through Fuzzing