Data-Driven Insights from Vulnerability Discovery Metrics International Workshop on Data-Driven Decisions, Experimentation and Evolution (Ddree) May 27, 2019
Total Page:16
File Type:pdf, Size:1020Kb
Data-driven Insights from Vulnerability Discovery Metrics International Workshop on Data-Driven Decisions, Experimentation and Evolution (DDrEE) May 27, 2019 Nuthan Munaiah Andrew Meneely PhD Candidate Associate Professor [email protected] [email protected] Department of Software Engineering Rochester Institute of Technology Rochester, NY Motivation | Security matters. | “… payouts for … exploits range from $5,000 to $1,500,000 …” [1] | Exploits as (cyber)weapons [2] | Developers must defend against innovative attacks | Security as an integral part of the development lifecycle | Leverage processes, tools, and techniques | Inculcate an attacker mindset ZERODIUM – How to Sell Your 0Day Exploit to ZERODIUM [1] S. Collins and S. McCombie. 2012. Stuxnet: The Emergence of a New Cyber Weapon and its Implications. Journal of Policing, Intelligence and Counter Terrorism [2] Motivation 3 | Metrics empirically validated using historical vulnerabilities | Metrics can ... | ... help discover vulnerabilities | ... reveal engineering failures that may have led to vulnerabilities | Numerous metrics exist [1] but their use has been limited [2] | Challenges: Granularity, effectiveness, actionability, and usability Morrison, P., Moye, D., Pandita, R., & Williams, L. 2018. Mapping the field of software life cycle security metrics. Information and Software Technology [1] Morrison, P., Herzig, K., Murphy, B., & Williams, L. (2015). Challenges with Applying Vulnerability Prediction Models. Symposium and Bootcamp on the Science of Security [2] Motivation 4 Project Vulnerable Model Metrics Measurements Neutral Motivation 5 | Interpretation | What are the metrics telling us? | What should we ask developers to do? | Example | Dependency [1]: Why does #include<foo.h> make foo.c vulnerable? | Churn [2]: Why does high churn make foo.c vulnerable? | Metrics as more than mere explanatory variables in a model | Metrics as agents of feedback Neuhaus et al. Predicting Vulnerable Software Components. CCS’07 [1] Zimmermann et al. Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista. ICST’10 [2] Motivation 6 Project Vulnerable Model Metrics Measurements Neutral Motivation 7 Comments Project Model Metrics Measurements Reviewers Developer Feedback Security Template(s) Feedback Feedback Generation Motivation 8 Vision, Goal, and Questions Vision Assist developers in engineering secure software by providing a technique that generates scientific, interpretable, and actionable feedback on security as the software evolves Questions | When to show the feedback? | What should the feedback contain? | Where should the feedback be shown? Vision, Goal, and Questions 10 Goal Propose an approach to generate natural language feedback on security through the interpretation of vulnerability discovery metrics Research Questions | Generalizability Are vulnerability discovery metrics similarly distributed across projects? | Thresholds Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities? Vision, Goal, and Questions 11 Dataset Metrics that we collected and analyzed in our study Dataset | We collected ten empirically-validated metrics in literature | Continuous-/Discrete-valued | Boolean-valued | Churn | Collaboration Centrality | Known Offender | Complexity | Contribution Centrality | Nesting | Source Lines of Code | # Inputs | # Outputs | # Paths | Implemented as docker containers to ease dissemination | Churn and Collaboration Centrality used as exemplars Dataset | Metrics 14 | Churn is a measure of amount change to a file $ git log --no-merges --no-renames --numstat --pretty=%H ... 9fc98e1974efa18497673aed79346e79227a84c5 1 0 chrome/test/data/webui/settings/cr_settings_browsertest.js ... | Collaboration centrality is a measure of diversity in perspective Alex baz.cc maz.hh Ahmed maz.hh Cody Dan scr.c foo.c Colin bar.c maz.hh Congyue Dataset | Metrics 15 Projects from which the metrics were collected from in our study Dataset Google Chrome Linux Apache Tomcat Mozilla Firefox OpenBSD WildFly Web Operating Application Browser System Server C/C++ Java | Large, mature, open source, and prolific history Dataset | Projects 17 Summary of the dataset used in the analysis Dataset Domain Project Language # Files SLOC* Chrome 930,265 9,054,450 Web Browser Firefox 509,221 6,977,203 C/C++ Linux 110,299 13,101,179 Operating System OpenBSD 142,337 9,147,222 Tomcat 6,038 326,748 Application Server Java WildFly 38,166 524,240 * Across all programming languages Dataset | Summary 19 Results Generalizability Are vulnerability discovery metrics similarly distributed across projects? Results Are vulnerability discovery metrics similarly distributed across projects? | Role of domain and language | Analyses | Violin Plots | Kruskal–Wallis, Mann–Whitney–Wilcoxon, and Cliff's δ | We use churn and collaboration as exemplars Results | Generalizability 22 | Churn appears similarly distributed but collaboration does not | We must, however, quantify the assessment Results | Generalizability 23 | Kruskal–Wallis: No similarly distributed metric (α = 2.78E-03) Cliff's δ (Effect) Dimension X Y Churn Collaboration Operating System 0.2148 (S) 0.1881 (S) Web Browser Domain 0.1928 (S) 0.3062 (S) Application Server Operating System 0.0667 (N) 0.3110 (S) Language C/C++ Java 0.1497 (S) 0.3078 (S) Chrome Firefox 0.0610 (N) 0.1043 (N) Project Linux OpenBSD 0.2056 (S) 0.9915 (L) Tomcat WildFly 0.1153 (N) 0.9955 (L) Effect (N) δ < 0.147 (S) 0.147 ≤ δ < 0.33 (L) δ > 0:474 Results | Generalizability 24 Generalizability Are vulnerability discovery metrics similarly distributed across projects? All metrics, except collaboration, are generalizable (i.e. have similar distributions) across the projects considered in our study irrespective of domain and language. Results | Generalizability 25 Thresholds Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities? Results Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities? | When does a metric foreshadow a problem (i.e. vulnerability)? | Unsupervised approach proposed by Alves et al. [1] | Delineate risk levels (low, medium, high, and critical) using thresholds | Analyses | % Historically-vulnerable Files Covered | Odds and Change in Odds Alves, T. L., Ypma, C., & Visser, J. 2010. Deriving Metric Thresholds from Benchmark Data. International Conference on Software Maintenance. [1] Results | Thresholds 27 | Deriving Metric Thresholds from Benchmark Data [1] | Unsupervised approach based solely on distribution of metric values | Six step process to determine thresholds Project File Churn Weight Chrome foo.c 8 20 Chrome bar.c 15 5 Firefox baz.cc 20 20 Firefox cat.h 8 50 Firefox maz.hh 15 3 Firefox mox.cc 25 19 Alves, T. L., Ypma, C., & Visser, J. 2010. Deriving Metric Thresholds from Benchmark Data. International Conference on Software Maintenance. [1] Results | Thresholds 28 Project File Churn Weight Total % Weight Chrome foo.c 8 20 80.00% 25 Chrome bar.c 15 5 20.00% Firefox baz.cc 20 20 21.74% Firefox cat.h 8 50 54.35% 92 Firefox maz.hh 15 3 3.26% Firefox mox.cc 25 19 20.65% Results | Thresholds 29 Project File Churn Weight Total % Weight Chrome foo.c 8 20 80.00% 25 Chrome bar.c 15 5 20.00% Firefox baz.cc 20 20 21.74% Firefox cat.h 8 50 54.35% 92 Firefox maz.hh 15 3 3.26% Firefox mox.cc 25 19 20.65% Churn % Weight Mean % Weight 8 54.35% Churn Mean % Weight Cumulative (54.35 + 80.00) / 2 % 8 80.00% 8 67.17% 67.17% 15 3.26% 15 11.63% 78.80% (3.26 + 20.00) / 2 % 15 20.00% 20 10.87% 89.67% 20 21.74% 21.74 / 2 % 25 10.33% 100.00% 25 20.65% 20.65 / 2 % Results | Thresholds 30 | Risk Levels | Low metric < 70% | Medium 70% ≤ metric < 80% | High 80% ≤ metric < 90% | Critical metric ≥ 90% Churn Cumulative Risk Range 8 67.17% Low metric < 15 15 78.80% Medium 15 ≤ metric < 20 20 89.67% High 20 ≤ metric < 25 25 100.00% Critical metric ≥ 25 Results | Thresholds 31 | Thresholds | 3,403 at 70% | 5,682 at 80% | 12,005 at 90% Results | Thresholds 32 | Known offender metric from Chrome | % Historically-vulnerable Files Covered Medium High Critical Aggregate 3,403 ≤ m < 5,682 5,682 ≤ m < 12,005 m ≥ 12,005 7.68% 8.39% 5.85% 21.92% | Odds and Change in Odds (∆) Medium High Critical Odds ∆Low Odds ∆Low ∆Medium Odds ∆Low ∆High 4.57E-02 13.9862 7.96E-02 24.3699 1.7424 1.12E-01 34.3548 1.4097 Results | Thresholds 33 Thresholds Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities? On average, non-trivial risk levels delineated by thresholds of generalizable vulnerability discovery metrics captured 23.85% of the historically vulnerable files in Chrome, providing support for the effectiveness of the thresholds in classifying risk from vulnerabilities Results | Thresholds 34 Insights https://chromium-review.googlesource.com/c/chromium/src/+/1552152 We believe all files modified in this change are at a higher risk of having undiscovered vulnerabilities. These suspect files are presented below along with the evidence that was used to support to assessment. • chrome/browser/about_flags.cc - The file has been changed a lot (churn at 97th percentile) by many developers who also changed many other files (contribution at 100th percentile) with these developers belonging to disparate developer groups (collaboration at 72nd percentile) and is hard to test exhaustively (nesting at 75th percentile). • chrome/browser/android/chrome_feature_list.cc - The file has been changed by many developers who also changed