Data-driven Insights from Vulnerability Discovery Metrics International Workshop on Data-Driven Decisions, Experimentation and Evolution (DDrEE) May 27, 2019

Nuthan Munaiah Andrew Meneely PhD Candidate Associate Professor [email protected] [email protected] Department of Software Engineering Rochester Institute of Technology Rochester, NY Motivation | Security matters. | “… payouts for … exploits range from $5,000 to $1,500,000 …” [1] | Exploits as (cyber)weapons [2] | Developers must defend against innovative attacks | Security as an integral part of the development lifecycle | Leverage processes, tools, and techniques | Inculcate an attacker mindset

ZERODIUM – How to Sell Your 0Day Exploit to ZERODIUM [1] S. Collins and S. McCombie. 2012. : The Emergence of a New Cyber Weapon and its Implications. Journal of Policing, Intelligence and Counter Terrorism [2]

Motivation 3 | Metrics empirically validated using historical vulnerabilities | Metrics can ... | ... help discover vulnerabilities | ... reveal engineering failures that may have led to vulnerabilities | Numerous metrics exist [1] but their use has been limited [2] | Challenges: Granularity, effectiveness, actionability, and usability

Morrison, P., Moye, D., Pandita, R., & Williams, L. 2018. Mapping the field of software life cycle security metrics. Information and Software Technology [1] Morrison, P., Herzig, K., Murphy, B., & Williams, L. (2015). Challenges with Applying Vulnerability Prediction Models. Symposium and Bootcamp on the Science of Security [2]

Motivation 4 Project Vulnerable

Model

Metrics Measurements Neutral

Motivation 5 | Interpretation | What are the metrics telling us? | What should we ask developers to do? | Example | Dependency [1]: Why does #include make foo.c vulnerable? | Churn [2]: Why does high churn make foo.c vulnerable? | Metrics as more than mere explanatory variables in a model | Metrics as agents of feedback

Neuhaus et al. Predicting Vulnerable Software Components. CCS’07 [1] Zimmermann et al. Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista. ICST’10 [2]

Motivation 6 Project Vulnerable

Model

Metrics Measurements Neutral

Motivation 7 Comments

Project

Model

Metrics Measurements Reviewers Developer

Feedback Security Template(s) Feedback Feedback Generation Motivation 8 Vision, Goal, and Questions Vision

Assist developers in engineering secure software by providing a technique that generates scientific, interpretable, and actionable feedback on security as the software evolves

Questions | When to show the feedback? | What should the feedback contain? | Where should the feedback be shown?

Vision, Goal, and Questions 10 Goal

Propose an approach to generate natural language feedback on security through the interpretation of vulnerability discovery metrics

Research Questions | Generalizability Are vulnerability discovery metrics similarly distributed across projects? | Thresholds Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities?

Vision, Goal, and Questions 11 Dataset Metrics that we collected and analyzed in our study

Dataset | We collected ten empirically-validated metrics in literature | Continuous-/Discrete-valued | Boolean-valued | Churn | Collaboration Centrality | Known Offender | Complexity | Contribution Centrality | Nesting | Source Lines of Code | # Inputs | # Outputs | # Paths | Implemented as docker containers to ease dissemination | Churn and Collaboration Centrality used as exemplars

Dataset | Metrics 14 | Churn is a measure of amount change to a file

$ git log --no-merges --no-renames --numstat --pretty=%H ... 9fc98e1974efa18497673aed79346e79227a84c5

1 0 chrome/test/data/webui/settings/cr_settings_browsertest.js ... | Collaboration centrality is a measure of diversity in perspective

Alex baz.cc maz.hh Ahmed

maz.hh Cody Dan scr.c foo.c Colin bar.c maz.hh Congyue

Dataset | Metrics 15 Projects from which the metrics were collected from in our study

Dataset Google Chrome Linux Apache Tomcat

Mozilla Firefox OpenBSD WildFly

Web Operating Application Browser System Server C/C++ Java

| Large, mature, open source, and prolific history

Dataset | Projects 17 Summary of the dataset used in the analysis

Dataset Domain Project Language # Files SLOC*

Chrome 930,265 9,054,450 Web Browser Firefox 509,221 6,977,203 C/C++ Linux 110,299 13,101,179 Operating System OpenBSD 142,337 9,147,222

Tomcat 6,038 326,748 Application Server Java WildFly 38,166 524,240 * Across all programming languages

Dataset | Summary 19 Results Generalizability Are vulnerability discovery metrics similarly distributed across projects?

Results Are vulnerability discovery metrics similarly distributed across projects?

| Role of domain and language | Analyses | Violin Plots | Kruskal–Wallis, Mann–Whitney–Wilcoxon, and Cliff's δ | We use churn and collaboration as exemplars

Results | Generalizability 22 | Churn appears similarly distributed but collaboration does not | We must, however, quantify the assessment

Results | Generalizability 23 | Kruskal–Wallis: No similarly distributed metric (α = 2.78E-03)

Cliff's δ (Effect) Dimension X Y Churn Collaboration Operating System 0.2148 (S) 0.1881 (S) Web Browser Domain 0.1928 (S) 0.3062 (S) Application Server Operating System 0.0667 (N) 0.3110 (S) Language C/C++ Java 0.1497 (S) 0.3078 (S) Chrome Firefox 0.0610 (N) 0.1043 (N) Project Linux OpenBSD 0.2056 (S) 0.9915 (L) Tomcat WildFly 0.1153 (N) 0.9955 (L) Effect (N) δ < 0.147 (S) 0.147 ≤ δ < 0.33 (L) δ > 0:474

Results | Generalizability 24 Generalizability

Are vulnerability discovery metrics similarly distributed across projects?

All metrics, except collaboration, are generalizable (i.e. have similar distributions) across the projects considered in our study irrespective of domain and language.

Results | Generalizability 25 Thresholds Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities?

Results Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities?

| When does a metric a problem (i.e. vulnerability)? | Unsupervised approach proposed by Alves et al. [1] | Delineate risk levels (low, medium, high, and critical) using thresholds | Analyses | % Historically-vulnerable Files Covered | Odds and Change in Odds Alves, T. L., Ypma, C., & Visser, J. 2010. Deriving Metric Thresholds from Benchmark Data. International Conference on Software Maintenance. [1]

Results | Thresholds 27 | Deriving Metric Thresholds from Benchmark Data [1] | Unsupervised approach based solely on distribution of metric values | Six step process to determine thresholds

Project File Churn Weight Chrome foo.c 8 20 Chrome bar.c 15 5 Firefox baz.cc 20 20 Firefox cat.h 8 50 Firefox maz.hh 15 3 Firefox mox.cc 25 19

Alves, T. L., Ypma, C., & Visser, J. 2010. Deriving Metric Thresholds from Benchmark Data. International Conference on Software Maintenance. [1]

Results | Thresholds 28 Project File Churn Weight Total % Weight Chrome foo.c 8 20 80.00% 25 Chrome bar.c 15 5 20.00% Firefox baz.cc 20 20 21.74% Firefox cat.h 8 50 54.35% 92 Firefox maz.hh 15 3 3.26% Firefox mox.cc 25 19 20.65%

Results | Thresholds 29 Project File Churn Weight Total % Weight Chrome foo.c 8 20 80.00% 25 Chrome bar.c 15 5 20.00% Firefox baz.cc 20 20 21.74% Firefox cat.h 8 50 54.35% 92 Firefox maz.hh 15 3 3.26% Firefox mox.cc 25 19 20.65%

Churn % Weight Mean % Weight 8 54.35% Churn Mean % Weight Cumulative (54.35 + 80.00) / 2 % 8 80.00% 8 67.17% 67.17% 15 3.26% 15 11.63% 78.80% (3.26 + 20.00) / 2 % 15 20.00% 20 10.87% 89.67% 20 21.74% 21.74 / 2 % 25 10.33% 100.00% 25 20.65% 20.65 / 2 %

Results | Thresholds 30 | Risk Levels | Low metric < 70% | Medium 70% ≤ metric < 80% | High 80% ≤ metric < 90% | Critical metric ≥ 90%

Churn Cumulative Risk Range 8 67.17% Low metric < 15 15 78.80% Medium 15 ≤ metric < 20 20 89.67% High 20 ≤ metric < 25 25 100.00% Critical metric ≥ 25

Results | Thresholds 31 | Thresholds | 3,403 at 70% | 5,682 at 80% | 12,005 at 90%

Results | Thresholds 32 | Known offender metric from Chrome | % Historically-vulnerable Files Covered

Medium High Critical Aggregate 3,403 ≤ m < 5,682 5,682 ≤ m < 12,005 m ≥ 12,005 7.68% 8.39% 5.85% 21.92% | Odds and Change in Odds (∆)

Medium High Critical

Odds ∆Low Odds ∆Low ∆Medium Odds ∆Low ∆High 4.57E-02 13.9862 7.96E-02 24.3699 1.7424 1.12E-01 34.3548 1.4097

Results | Thresholds 33 Thresholds

Are thresholds of vulnerability discovery metrics effective at classifying risk from vulnerabilities?

On average, non-trivial risk levels delineated by thresholds of generalizable vulnerability discovery metrics captured 23.85% of the historically vulnerable files in Chrome, providing support for the effectiveness of the thresholds in classifying risk from vulnerabilities

Results | Thresholds 34 Insights https://chromium-review.googlesource.com/c/chromium/src/+/1552152

We believe all files modified in this change are at a higher risk of having undiscovered vulnerabilities. These suspect files are presented below along with the evidence that was used to support to assessment.

• chrome/browser/about_flags.cc - The file has been changed a lot (churn at 97th percentile) by many developers who also changed many other files (contribution at 100th percentile) with these developers belonging to disparate developer groups (collaboration at 72nd percentile) and is hard to test exhaustively (nesting at 75th percentile). • chrome/browser/android/chrome_feature_list.cc - The file has been changed by many developers who also changed many other files (contribution at 99th percentile) with these developers belonging to disparate developer groups (collaboration at 93rd percentile). • chrome/browser/android/chrome_feature_list.h - The file has been changed by many developers who also changed many other files (contribution at 90th percentile). • chrome/browser/flag_descriptions.cc - The file has been changed a lot (churn at 93rd percentile) by many developers who also changed many other files (contribution at 99th percentile). • chrome/browser/flag_descriptions.h - The file has been changed a lot (churn at 91st percentile) by many developers who also changed many other files (contribution at 99th percentile).

Insights 36 https://chromium-review.googlesource.com/c/chromium/src/+/1552152

We believe all files modified in this change are at a higher risk of having undiscovered vulnerabilities. These suspect files are presented below along with the evidence that was used to support to assessment.

• chrome/browser/about_flags.cc - The file has been changed a lot (churn at 97th percentile) by many developers who also changed many other files (contribution at 100th percentile) with these developers belonging to disparate developer groups (collaboration at 72nd percentile) and is hard to test exhaustively (nesting at 75th percentile). • ...

Insights 37 Insights 38 Summary | Humanize existing vulnerability discovery knowledge

Literature Metrics Feedback Conversation Secure Software | Feedback is central to realizing our vision | When to show the feedback? | Threshold-delineated risk to identify suspect files | Ongoing Work | What should the feedback contain? | Where should the feedback be shown? https://samaritan.github.io/ /samaritan Summary 40