Data-driven Insights from Vulnerability Discovery Metrics International Workshop on Data-Driven Decisions, Experimentation and Evolution (DDrEE) May 27, 2019
Nuthan Munaiah Andrew Meneely PhD Candidate Associate Professor [email protected] [email protected] Department of Software Engineering Rochester Institute of Technology Rochester, NY Motivation | Security matters. | “… payouts for … exploits range from $5,000 to $1,500,000 …” [1] | Exploits as (cyber)weapons [2] | Developers must defend against innovative attacks | Security as an integral part of the development lifecycle | Leverage processes, tools, and techniques | Inculcate an attacker mindset
ZERODIUM – How to Sell Your 0Day Exploit to ZERODIUM [1] S. Collins and S. McCombie. 2012. Stuxnet: The Emergence of a New Cyber Weapon and its Implications. Journal of Policing, Intelligence and Counter Terrorism [2]
Motivation 3 | Metrics empirically validated using historical vulnerabilities | Metrics can ... | ... help discover vulnerabilities | ... reveal engineering failures that may have led to vulnerabilities | Numerous metrics exist [1] but their use has been limited [2] | Challenges: Granularity, effectiveness, actionability, and usability
Morrison, P., Moye, D., Pandita, R., & Williams, L. 2018. Mapping the field of software life cycle security metrics. Information and Software Technology [1] Morrison, P., Herzig, K., Murphy, B., & Williams, L. (2015). Challenges with Applying Vulnerability Prediction Models. Symposium and Bootcamp on the Science of Security [2]
Motivation 4 Project Vulnerable
Model
Metrics Measurements Neutral
Motivation 5 | Interpretation | What are the metrics telling us? | What should we ask developers to do? | Example | Dependency [1]: Why does #include
Neuhaus et al. Predicting Vulnerable Software Components. CCS’07 [1] Zimmermann et al. Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista. ICST’10 [2]
Motivation 6 Project Vulnerable
Model
Metrics Measurements Neutral
Motivation 7 Comments
Project
Model
Metrics Measurements Reviewers Developer
Feedback Security Template(s) Feedback Feedback Generation Motivation 8 Vision, Goal, and Questions Vision
Assist developers in engineering secure software by providing a technique that generates scientific, interpretable, and actionable feedback on security as the software evolves
Questions | When to show the feedback? | What should the feedback contain? | Where should the feedback be shown?
Vision, Goal, and Questions 10 Goal
Propose an approach to generate natural language feedback on security through the interpretation of vulnerability discovery metrics
Research Questions | Generalizability Are vulnerability discovery metrics similarly distributed across projects? | Thresholds Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities?
Vision, Goal, and Questions 11 Dataset Metrics that we collected and analyzed in our study
Dataset | We collected ten empirically-validated metrics in literature | Continuous-/Discrete-valued | Boolean-valued | Churn | Collaboration Centrality | Known Offender | Complexity | Contribution Centrality | Nesting | Source Lines of Code | # Inputs | # Outputs | # Paths | Implemented as docker containers to ease dissemination | Churn and Collaboration Centrality used as exemplars
Dataset | Metrics 14 | Churn is a measure of amount change to a file
$ git log --no-merges --no-renames --numstat --pretty=%H ... 9fc98e1974efa18497673aed79346e79227a84c5
1 0 chrome/test/data/webui/settings/cr_settings_browsertest.js ... | Collaboration centrality is a measure of diversity in perspective
Alex baz.cc maz.hh Ahmed
maz.hh Cody Dan scr.c foo.c Colin bar.c maz.hh Congyue
Dataset | Metrics 15 Projects from which the metrics were collected from in our study
Dataset Google Chrome Linux Apache Tomcat
Mozilla Firefox OpenBSD WildFly
Web Operating Application Browser System Server C/C++ Java
| Large, mature, open source, and prolific history
Dataset | Projects 17 Summary of the dataset used in the analysis
Dataset Domain Project Language # Files SLOC*
Chrome 930,265 9,054,450 Web Browser Firefox 509,221 6,977,203 C/C++ Linux 110,299 13,101,179 Operating System OpenBSD 142,337 9,147,222
Tomcat 6,038 326,748 Application Server Java WildFly 38,166 524,240 * Across all programming languages
Dataset | Summary 19 Results Generalizability Are vulnerability discovery metrics similarly distributed across projects?
Results Are vulnerability discovery metrics similarly distributed across projects?
| Role of domain and language | Analyses | Violin Plots | Kruskal–Wallis, Mann–Whitney–Wilcoxon, and Cliff's δ | We use churn and collaboration as exemplars
Results | Generalizability 22 | Churn appears similarly distributed but collaboration does not | We must, however, quantify the assessment
Results | Generalizability 23 | Kruskal–Wallis: No similarly distributed metric (α = 2.78E-03)
Cliff's δ (Effect) Dimension X Y Churn Collaboration Operating System 0.2148 (S) 0.1881 (S) Web Browser Domain 0.1928 (S) 0.3062 (S) Application Server Operating System 0.0667 (N) 0.3110 (S) Language C/C++ Java 0.1497 (S) 0.3078 (S) Chrome Firefox 0.0610 (N) 0.1043 (N) Project Linux OpenBSD 0.2056 (S) 0.9915 (L) Tomcat WildFly 0.1153 (N) 0.9955 (L) Effect (N) δ < 0.147 (S) 0.147 ≤ δ < 0.33 (L) δ > 0:474
Results | Generalizability 24 Generalizability
Are vulnerability discovery metrics similarly distributed across projects?
All metrics, except collaboration, are generalizable (i.e. have similar distributions) across the projects considered in our study irrespective of domain and language.
Results | Generalizability 25 Thresholds Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities?
Results Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities?
| When does a metric foreshadow a problem (i.e. vulnerability)? | Unsupervised approach proposed by Alves et al. [1] | Delineate risk levels (low, medium, high, and critical) using thresholds | Analyses | % Historically-vulnerable Files Covered | Odds and Change in Odds Alves, T. L., Ypma, C., & Visser, J. 2010. Deriving Metric Thresholds from Benchmark Data. International Conference on Software Maintenance. [1]
Results | Thresholds 27 | Deriving Metric Thresholds from Benchmark Data [1] | Unsupervised approach based solely on distribution of metric values | Six step process to determine thresholds
Project File Churn Weight Chrome foo.c 8 20 Chrome bar.c 15 5 Firefox baz.cc 20 20 Firefox cat.h 8 50 Firefox maz.hh 15 3 Firefox mox.cc 25 19
Alves, T. L., Ypma, C., & Visser, J. 2010. Deriving Metric Thresholds from Benchmark Data. International Conference on Software Maintenance. [1]
Results | Thresholds 28 Project File Churn Weight Total % Weight Chrome foo.c 8 20 80.00% 25 Chrome bar.c 15 5 20.00% Firefox baz.cc 20 20 21.74% Firefox cat.h 8 50 54.35% 92 Firefox maz.hh 15 3 3.26% Firefox mox.cc 25 19 20.65%
Results | Thresholds 29 Project File Churn Weight Total % Weight Chrome foo.c 8 20 80.00% 25 Chrome bar.c 15 5 20.00% Firefox baz.cc 20 20 21.74% Firefox cat.h 8 50 54.35% 92 Firefox maz.hh 15 3 3.26% Firefox mox.cc 25 19 20.65%
Churn % Weight Mean % Weight 8 54.35% Churn Mean % Weight Cumulative (54.35 + 80.00) / 2 % 8 80.00% 8 67.17% 67.17% 15 3.26% 15 11.63% 78.80% (3.26 + 20.00) / 2 % 15 20.00% 20 10.87% 89.67% 20 21.74% 21.74 / 2 % 25 10.33% 100.00% 25 20.65% 20.65 / 2 %
Results | Thresholds 30 | Risk Levels | Low metric < 70% | Medium 70% ≤ metric < 80% | High 80% ≤ metric < 90% | Critical metric ≥ 90%
Churn Cumulative Risk Range 8 67.17% Low metric < 15 15 78.80% Medium 15 ≤ metric < 20 20 89.67% High 20 ≤ metric < 25 25 100.00% Critical metric ≥ 25
Results | Thresholds 31 | Thresholds | 3,403 at 70% | 5,682 at 80% | 12,005 at 90%
Results | Thresholds 32 | Known offender metric from Chrome | % Historically-vulnerable Files Covered
Medium High Critical Aggregate 3,403 ≤ m < 5,682 5,682 ≤ m < 12,005 m ≥ 12,005 7.68% 8.39% 5.85% 21.92% | Odds and Change in Odds (∆)
Medium High Critical
Odds ∆Low Odds ∆Low ∆Medium Odds ∆Low ∆High 4.57E-02 13.9862 7.96E-02 24.3699 1.7424 1.12E-01 34.3548 1.4097
Results | Thresholds 33 Thresholds
Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities?
On average, non-trivial risk levels delineated by thresholds of generalizable vulnerability discovery metrics captured 23.85% of the historically vulnerable files in Chrome, providing support for the effectiveness of the thresholds in classifying risk from vulnerabilities
Results | Thresholds 34 Insights https://chromium-review.googlesource.com/c/chromium/src/+/1552152
We believe all files modified in this change are at a higher risk of having undiscovered vulnerabilities. These suspect files are presented below along with the evidence that was used to support to assessment.
• chrome/browser/about_flags.cc - The file has been changed a lot (churn at 97th percentile) by many developers who also changed many other files (contribution at 100th percentile) with these developers belonging to disparate developer groups (collaboration at 72nd percentile) and is hard to test exhaustively (nesting at 75th percentile). • chrome/browser/android/chrome_feature_list.cc - The file has been changed by many developers who also changed many other files (contribution at 99th percentile) with these developers belonging to disparate developer groups (collaboration at 93rd percentile). • chrome/browser/android/chrome_feature_list.h - The file has been changed by many developers who also changed many other files (contribution at 90th percentile). • chrome/browser/flag_descriptions.cc - The file has been changed a lot (churn at 93rd percentile) by many developers who also changed many other files (contribution at 99th percentile). • chrome/browser/flag_descriptions.h - The file has been changed a lot (churn at 91st percentile) by many developers who also changed many other files (contribution at 99th percentile).
Insights 36 https://chromium-review.googlesource.com/c/chromium/src/+/1552152
We believe all files modified in this change are at a higher risk of having undiscovered vulnerabilities. These suspect files are presented below along with the evidence that was used to support to assessment.
• chrome/browser/about_flags.cc - The file has been changed a lot (churn at 97th percentile) by many developers who also changed many other files (contribution at 100th percentile) with these developers belonging to disparate developer groups (collaboration at 72nd percentile) and is hard to test exhaustively (nesting at 75th percentile). • ...
Insights 37 Insights 38 Summary | Humanize existing vulnerability discovery knowledge
Literature Metrics Feedback Conversation Secure Software | Feedback is central to realizing our vision | When to show the feedback? | Threshold-delineated risk to identify suspect files | Ongoing Work | What should the feedback contain? | Where should the feedback be shown? https://samaritan.github.io/ /samaritan Summary 40