Understanding the Reproducibility of Crowd- Reported Security

Understanding the Reproducibility of Crowd- reported Security Vulnerabilities Dongliang Mu, Nanjing University; Alejandro Cuevas, The Pennsylvania State University; Limin Yang and Hang Hu, Virginia Tech; Xinyu Xing, The Pennsylvania State University; Bing Mao, Nanjing University; Gang Wang, Virginia Tech https://www.usenix.org/conference/usenixsecurity18/presentation/mu This paper is included in the Proceedings of the 27th USENIX Security Symposium. August 15–17, 2018 • Baltimore, MD, USA ISBN 978-1-939133-04-5 Open access to the Proceedings of the 27th USENIX Security Symposium is sponsored by USENIX. Understanding the Reproducibility of Crowd-reported Security Vulnerabilities †‡Dongliang Mu,∗ ‡Alejandro Cuevas, §Limin Yang, §Hang Hu ‡Xinyu Xing, †Bing Mao, §Gang Wang †National Key Laboratory for Novel Software Technology, Nanjing University, China ‡College of Information Sciences and Technology, The Pennsylvania State University, USA §Department of Computer Science, Virginia Tech, USA [email protected], [email protected], fliminyang, [email protected], [email protected], [email protected], [email protected] Abstract 1 Introduction Security vulnerabilities in software systems are posing a Today’s software systems are increasingly relying on the serious threat to users, organizations and even nations. In “power of the crowd” to identify new security vulnera- 2017, unpatched vulnerabilities allowed the WannaCry bilities. And yet, it is not well understood how repro- ransomware cryptoworm to shutdown more than 300,000 ducible the crowd-reported vulnerabilities are. In this computers around the globe [24]. Around the same time, paper, we perform the first empirical analysis on a wide another vulnerability in Equifax’s Apache servers led to range of real-world security vulnerabilities (368 in total) a devastating data breach that exposed half of the Amer- with the goal of quantifying their reproducibility. Fol- ican population’s Social Security Numbers [48]. lowing a carefully controlled workflow, we organize a Identifying security vulnerabilities has been increas- focused group of security analysts to carry out reproduc- ingly challenging. Due to the high complexity of mod- tion experiments. With 3600 man-hours spent, we ob- ern software, it is no longer feasible for in-house teams tain quantitative evidence on the prevalence of missing to identify all possible vulnerabilities before a software information in vulnerability reports and the low repro- release. Consequently, an increasing number of soft- ducibility of the vulnerabilities. We find that relying on a ware vendors have begun to rely on “the power of the single vulnerability report from a popular security forum crowd” for vulnerability identification. Today, anyone is generally difficult to succeed due to the incomplete on the Internet (e.g., white hat hackers, security analysts, information. By widely crowdsourcing the information and even regular software users) can identify and report a gathering, security analysts could increase the reproduc- vulnerability. Companies such as Google and Microsoft tion success rate, but still face key challenges to trou- are spending millions of dollars on their “bug bounty” bleshoot the non-reproducible cases. To further explore programs to reward vulnerability reporters [38, 54, 41]. solutions, we surveyed hackers, researchers, and engi- To further raise community awareness, the reporter may neers who have extensive domain expertise in software obtain a Common Vulnerabilities and Exposures (CVE) security (N=43). Going beyond Internet-scale crowd- ID, and archive the entry in various online vulnerability sourcing, we find that, security professionals heavily rely databases. As of December 2017, the CVE website has on manual debugging and speculative guessing to infer archived more than 95,000 security vulnerabilities. the missed information. Our result suggests that there is Despite the large number of crowd-reported vulnera- not only a necessity to overhaul the way a security fo- bilities, there is still a major gap between vulnerability rum collects vulnerability reports, but also a need for au- reporting and vulnerability patching. Recent measure- tomated mechanisms to collect information commonly ments show that it takes a long time, sometimes multiple missing in a report. years, for a vulnerability to be patched after the initial report [43]. In addition to the lack of awareness, anec- dotal evidence also asserts the poor quality of crowd- sourced reports. For example, a Facebook user once ∗Work was done while visiting The Pennsylvania State University. identified a vulnerability that allowed attackers to post USENIX Association 27th USENIX Security Symposium 919 messages onto anyone’s timeline. However, the initial ably large. For example, prior works have used reported report had been ignored by Facebook engineers due to vulnerabilities to benchmark their vulnerability detection “lack of enough details to reproduce the vulnerability”, and patching tools. Most datasets are limited to less than until the Facebook CEO’s timeline was hacked [18]. 10 vulnerabilities [39, 29, 40, 46, 25], or at the scale of As more vulnerabilities are reported by the crowd, the tens [55, 56, 27, 42], due to the significant manual efforts reproducibility of the vulnerability becomes critical for needed to build ground truth data. software vendors to quickly locate and patch the prob- We have a number of key observations. First, in- lem. Unfortunately, a non-reproducible vulnerability is dividual vulnerability reports from popular security fo- more likely to be ignored [53], leaving the affected sys- rums have an extremely low success rate of reproduction tem vulnerable. So far, related research efforts have pri- (4.5% – 43.8%) caused by missing information. Second, marily focused on vulnerability notifications, and gener- a “crowdsourcing” approach that aggregates information ating security patches [26, 35, 43, 45]. The vulnerability from all possible references help to recover some but not reproduction, as a critical early step for risk mitigation, all of the missed fields. After information aggregation, has not been well understood. 95.1% of the 368 vulnerabilities still missed at least one In this paper, we bridge the gap by conducting the required information field. Third, it is not always the first in-depth empirical analysis on the reproducibility most commonly missed information that foiled the re- of crowd-reported vulnerabilities. We develop a series production. Most reports did not include details on soft- of experiments to assess the usability of the information ware installation options and configurations (87%+), or provided by the reporters by actually attempting to re- the affected operating system (OS) (22.8%). While such produce the vulnerabilities. Our analysis seeks to answer information is often recoverable using “common sense” three specific questions. First, how reproducible are the knowledge, the real challenges arise when the vulner- reported vulnerabilities using only the provided informa- ability reports missed the Proof-of-Concept (PoC) files tion? Second, what factors have made certain vulnera- (11.7%) or, more often, the methods to trigger the vul- bilities difficult to reproduce? Third, what actions could nerability (26.4%). Based on the aggregated informa- software vendors (and the vulnerability reporters) take to tion and common sense knowledge, only 54.9% of the systematically improve the efficiency of reproduction? reported vulnerabilities can be reproduced. Recovering the missed information is even more chal- Assessing Reproducibility. The biggest challenge is lenging given the limited feedback on “why a system did that reproducing a vulnerability requires almost exclu- not crash”. To recover the missing information, we iden- sively manual efforts, and requires the “reproducer” to tified useful heuristics through extensive manual debug- have highly specialized knowledge and skill sets. It is ging and troubleshooting, which increased the reproduc- difficult for a study to achieve both depth and scale at tion rate to 95.9%. We find it helpful to prioritize test- the same time. To these ends, we prioritize depth while ing the information fields that are likely to require non- preserving a reasonable scale for generalizable results. standard configurations. We also observe useful correla- More specifically, we focus on memory error vulnerabil- tions between “similar” vulnerability reports, which can ities, which are ranked among the most dangerous soft- provide hints to reproduce the poorly documented ones. ware errors [7] and have caused significant real-world Despite these heuristics, we argue that significant man- impacts (e.g., Heartbleed, WannaCry). We organize a fo- ual efforts could have been saved if the reporting system cused group of highly experienced security researchers required a few mandated information fields. and conduct a series of controlled experiments to reproduce the vulnerabilities based on the provided informa- Survey. To validate our observations, we surveyed tion. We carefully design a workflow so that the repro- external security professionals from both academia and 1 duction results reflect the value of the information in the industry . We received 43 valid responses from 10 dif- reports, rather than the analysts’ personal hacking skills. ferent institutions, including 2 industry labs, 6 academic groups and 2 Capture The Flag (CTF) teams. The survey Our experiments demanded 3600 man-hours to finish, results confirmed

Load more