Build It Break It: Measuring and Comparing Development Security
Total Page:16
File Type:pdf, Size:1020Kb
Build It Break It: Measuring and Comparing Development Security Andrew Ruef, Michael Hicks, James Parker, Dave Levin, Atif Memon, Jandelyn Plane, and Piotr Mardziel University of Maryland, College Park Abstract and activities, such as code reviews and penetra- tion testing, can improve software’s security. The There is currently little evidence about what tools, question is how effective these (and other) things methods, processes, and languages lead to secure are, in terms of their costs vs. benefits. Companies software. We present the experimental design of the would like to know: Which technologies or pro- Build it Break it secure programming contest as an cesses shake out the most (critical) bugs? Which aim to provide such evidence. The contest also pro- are the easiest to use? Which have the least impact vides education value to participants where they on other metrics of software quality, such as perfor- gain experience developing programs in an adver- mance and maintainability? It is currently difficult sarial settings. We show preliminary results from to answer these questions because we lack the em- previous runs of the contest that demonstrate the pirical evidence needed to do so. contest works as designed, and provides the data desired. We are in the process of scaling the contest In this paper we present Build-it, Break-it, Fix- to collect larger data sets with the goal of making it (BiBiFi), a new software security contest that is statistically significant correlations between vari- a fun experience for contestants, as well as an ex- ous factors of development and software security. perimental testbed by which researchers can gather empirical evidence about secure development. A BiBiFi contest has three phases. The first phase, 1 Introduction Build-it, asks small development teams to build software according to a provided specification in- Experts have long advocated that achieving se- cluding security goals. The software is scored for curity in a computer system requires careful de- being correct, fast, and feature-ful. The second sign and implementation from the ground up [19]. phase, Break-it, asks teams to find security viola- Over the last decade, Microsoft [10, 11] and oth- tions and other problems in other teams’ build-it ers [23, 15, 13] have described processes and meth- submissions. Identified problems, proved via test ods for building secure software, and researchers cases, benefit a Break-it team’s score while penal- and companies have, among other activities, devel- izing the Build-it team’s score. (A team’s break-it oped software analysis and testing tools [8, 6, 14] to and build-it scores are independent, with prizes for help developers find vulnerabilities. Despite these top scorers in each category.) The final phase, Fix- efforts, the situation seems to have worsened, not it, gives builders a chance to fix bugs and thereby improved, with vulnerable software proliferating possibly get some points back if the process discov- despite billions spent annually on security [3]. ers that distinct break-it test cases identify the same This disconnect between secure development in- underlying software problem. novation and marketplace adoption may be due to We designed BiBiFi to give researchers in situ a lack of evidence. In particular, no one would dis- data on how teams develop code—for better or agree that certain technologies, such as type-safe for worse—when security is a first-order design programming languages and static analysis tools, goal. As such, BiBiFi contestants are afforded a 1 significant amount of freedom. Build-it teams can public, large-scale contest whose focus is on build- choose to build their software using whatever pro- ing secure systems. Our contest takes inspiration gramming language(s), libraries, tools, or processes from popular capture-the-flag (CTF) cybersecurity they want, and likewise Break-it teams can hunt for contests [5, 7, 16] and from programming contests vulnerabilities as they like, using manual code re- [21, 2, 12], but is unique in that it combines ideas views, analysis or testing tools, reverse engineer- from both in a novel way. Although no contest can ing, and so on. During the contest, we collect data truly capture the many difficulties of professional on each team’s activities. For example, we keep development, we hope it will provide some general track of commit logs, testing outcomes, and inter- insight nevertheless, and thus provide some sorely actions with the contest site. Prior to and after the needed evidence about what works for building se- contest, we survey the participants to record their cure software. background, mindset, choices, and activities. By correlating how teams worked with how well 2 Build-it, Break-it, Fix-it: Design they performed in the competition, researchers can learn about what practices do and do not work. This section summarizes the BiBiFi contest goals Treating the contest as a scientific experiment, the and how the contest’s design satisfies those goals. contest problem is the control, the final score is the dependent variable, and the various choices made by each team, which we observe, are the indepen- 2.1 Goals, incentives, and disincentives dent variables. To permit repeatable, statistically significant results, we have designed BiBiFi and Our most basic goals are that the winners of the implemented its infrastructure to scale to multiple Build-it phase truly produced the highest quality competitions with thousands of competitors. software, and that the winners of the Break-it phase performed the most thorough, creative analysis of Our hope is to use BiBiFi—and for other re- others’ code. We break down these high-level goals searchers to apply it, as well—to empirically learn into a set of 12 concrete requirements that guide our what factors most strongly influence security out- design. comes: Does better security correlate with: the The winning Build-it team would ideally develop choice of programming language? the use of de- software that is (1) correct, (2) secure, (3) efficient, velopment or analysis tools? the use of code re- and (4) maintainable. While the quality of primary views? participants’ background (e.g., certification) interest to our contest is security, the other aspects and education (e.g., particular majors or courses)? of quality cannot be ignored. It is easy to produce the choice of a particular build process, or perhaps secure software that is incorrect (have it do noth- merely the size of the team itself? As educators, ing!). Ignoring performance is also unreasonable, we are particularly interested in applying answers as performance concerns are often cited as reasons to questions like these to better prepare students to not to deploy defense mechanisms [1, 22]. Main- write responsible, secure code. tainability is also important—spaghetti code might This paper makes two main contributions. First, be harder for attackers to reverse engineer, but it it describes the design of BiBiFi (Section 2) and the would also be harder for engineers to maintain. So contest infrastructure (Section 3), focusing on how we would like to encourage build-it teams to focus both aim to ensure that high scores in the contest on all four aspects of quality. correlate with meaningful, real-world outcomes. The winning Break-it team would ideally (5) find Second, it describes our first run of the contest, as many defects as possible in the submitted soft- held during the Fall of 2014 (Section 4). We de- ware, thus giving greater confidence to our as- scribe the contest problem, the rationale for its de- sessment that a particular contestant’s software is sign, and details of how the contest proceeded and more secure than another’s. To maximize cover- concluded. While participation in this contest was age, each break-it team should have incentive to modest, and thus we cannot view the data we gath- (6) look carefully at multiple submissions, rather ered as significant, the outcomes were interesting, than target a few submissions. Moreover, teams and taught us important lessons that can be applied should be given incentive to, in some cases, (7) ex- to future runs. plore code in depth, to find tricky defects and not To the best of our knowledge, BiBiFi is the first just obvious ones. We would like break-it teams 2 to (8) produce useful bug reports that can be un- domize each break-it team’s view of the build-it derstood and acted on to produce a fix. Finally, we teams’ submissions, but organize them by meta- would like to (9) discourage superficially different data like programming language. When they think bug reports that use different inputs to identify the they have found a defect, they submit a test case. same underlying flaw. BiBiFi’s infrastructure accepts this as a defect if it We also have several general goals. First, we passes our reference implementation but fails on want to (10) scale up the contest as much as possi- the buggy implementation. Any failing test is po- ble, allowing for up to several hundred participat- tentially worth points; we take the view that a cor- ing teams. The more teams we have, the greater rectness bug is potentially a security bug waiting our impact from both a teaching and research point to be exploited. That said, additional points are of view: on the teaching side, more students expe- granted to an actual exploit or to demonstrated vi- rience the desired learning outcomes, and on the olations of security goals; for the latter we require research side, we have a greater corpus of data to special test case formats (see Section 4). To encour- draw on. To support scalability, the contest de- age coverage, a break-it team may submit up to ten sign should (10a) encourage greater participation, passing test cases per build-it submission.