Noise and Heterogeneity in Historical Build Data an Empirical Study of Travis CI
Total Page:16
File Type:pdf, Size:1020Kb
Noise and Heterogeneity in Historical Build Data An Empirical Study of Travis CI Keheliya Gallaba Christian Macho McGill University University of Klagenfurt Montréal, Canada Klagenfurt, Austria [email protected] [email protected] Martin Pinzger Shane McIntosh University of Klagenfurt McGill University Klagenfurt, Austria Montréal, Canada [email protected] [email protected] ABSTRACT 1 INTRODUCTION Automated builds, which may pass or fail, provide feedback to a After making source code changes, developers execute automated development team about changes to the codebase. A passing build builds to check the impact on the software product. These builds indicates that the change compiles cleanly and tests (continue to) are triggered while features are being developed, when changes pass. A failing (a.k.a., broken) build indicates that there are issues have been submitted for peer review, and/or prior to integration that require attention. Without a closer analysis of the nature of into the software project’s version control system. build outcome data, practitioners and researchers are likely to make Tools such as Travis CI facilitate the practice of Continuous two critical assumptions: (1) build results are not noisy; however, Integration (CI), where code changes are downloaded regularly passing builds may contain failing or skipped jobs that are actively onto dedicated servers to be compiled and tested [1]. The popularity or passively ignored; and (2) builds are equal; however, builds vary of development platforms such as GitHub and CI services such in terms of the number of jobs and conigurations. as Travis CI have made the data about automated builds from a To investigate the degree to which these assumptions about build plethora of open source projects readily available for analysis. breakage hold, we perform an empirical study of 3.7 million build Characterizing build outcome data will help software practition- jobs spanning 1,276 open source projects. We ind that: (1) 12% of ers and researchers when building tools and proposing techniques passing builds have an actively ignored failure; (2) 9% of builds have to solve software engineering problems. For example, Rausch et a misleading or incorrect outcome on average; and (3) at least 44% al. [18] identiied the most common breakage types in 14 Java ap- of the broken builds contain passing jobs, i.e., the breakage is local plications and Vassallo et al. [26] compared breakages from 349 to a subset of build variants. Like other software archives, build open source Java projects to those of a inancial organization. While data is noisy and complex. Analysis of build data requires nuance. these studies make important observations, understanding the nu- ances and complexities of build outcome data has not received suf- CCS CONCEPTS icient attention by software engineering researchers. Early work · Software and its engineering → Software veriication and by Zolfagharinia et al. [29] shows that build failures in the Perl validation; Software post-development issues; project tend to be time- and platform-sensitive, suggesting that interpretation of build outcome data is not straightforward. KEYWORDS To support interpretation of build outcome data, in this paper, Automated Builds, Build Breakage, Continuous Integration we set out to characterize build outcome data according to two harmful assumptions that one may make. To do so, we conduct ACM Reference Format: an empirical study of 3,702,071 build results spanning 1,276 open Keheliya Gallaba, Christian Macho, Martin Pinzger, and Shane McIntosh. source projects that use the Travis CI service. 2018. Noise and Heterogeneity in Historical Build Data: An Empirical Study of Travis CI. In Proceedings of the 2018 33rd ACM/IEEE International Noise. First, one may assume that build outcomes are free of noise. Conference on Automated Software Engineering (ASE ’18), September 3ś However, we ind that in practice, some builds that are marked as 7, 2018, Montpellier, France. ACM, New York, NY, USA, 11 pages. https: successful contain breakages that need attention yet are ignored. For //doi.org/10.1145/3238147.3238171 example, developers may label platforms in their Travis CI conigu- allow_failure Permission to make digital or hard copies of all or part of this work for personal or rations as to enable experimentation with support classroom use is granted without fee provided that copies are not made or distributed for a new platform. The expectation is that once platform support for proit or commercial advantage and that copies bear this notice and the full citation has stabilized, developers will remove allow_failure; however, on the irst page. Copyrights for components of this work owned by others than ACM 1 must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, this is not always the case. For example, the zdavatz/spreadsheet to post on servers or to redistribute to lists, requires prior speciic permission and/or a project has had the allow_failure feature enabled for the entire fee. Request permissions from [email protected]. lifetime of the project (ive years). Examples like this suggest that ASE ’18, September 3ś7, 2018, Montpellier, France © 2018 Association for Computing Machinery. noise is likely present in build outcome data. ACM ISBN 978-1-4503-5937-5/18/09...$15.00 https://doi.org/10.1145/3238147.3238171 1https://github.com/zdavatz/spreadsheet ASE ’18, September 3–7, 2018, Montpellier, France Keheliya Gallaba, Christian Macho, Martin Pinzger, and Shane McIntosh There are also builds that are marked as broken that do not 2.1 Corpus of Candidate Systems receive the immediate attention of the development team. It is We conduct this study by using openly available project metadata unlikely that such broken builds are as distracting for development and build results of GitHub projects that use the Travis CI service teams as one may assume. For example, we ind that on average, to automate their builds. GitHub is the world’s largest hosting two in every three breakages are stale, i.e., occur multiple times in service of open source software, with around 20 million users and 57 a project’s build history. To quantify the amount of noise in build million repositories, in 2017.3 A recent analysis shows that Travis outcome data, we propose an adapted signal-to-noise ratio. CI is the most popular CI service among projects on GitHub.4 Heterogeneity. Second, one may assume that builds are homoge- neous. However, builds vary in terms of the number of executed 2.2 Retrieve Raw Data jobs and the number of supported build-time conigurations. For We begin by retrieving the TravisTorrent dataset [3], which con- example, if the Travis CI coniguration includes four Ruby versions tains build outcome data from GitHub projects that use the Travis and three Java versions to be tested, twelve jobs will be created CI service. As of our retrieval, the TravisTorrent dataset contains per build because 4 × 3 combinations are possible. Zolfagharinia et data about 3,702,595 build jobs that belong to 680,209 builds span- al. [29] observed that automated builds for Perl package releases ning 1,283 GitHub projects. Those builds include one to 252 build take place on a median of 22 environments and seven operating jobs (median of 3). In addition to build-related data, the Travis- systems. Builds also vary in terms of the type of contributor. Indeed, Torrent dataset contains details about the GitHub activity that build outcome and team response may difer depending on the role triggered each build. For example, every build includes a commit of the contributor (core, peripheral). hash (a reference to the build triggering activity in its Git reposi- In this paper, we study build heterogeneity according to ma- tory), the amount of churn in the revision, the number of modiied trix breakage purity, breakage reasons, and contributor type. We iles, and the programming language of the project. TravisTorrent ind that (1) environment-speciic breakages are as common as also includes the number of executed and passed tests. environment-agnostic breakages; (2) the reasons for breakage vary TravisTorrent alone does not satisfy all of the requirements of and can be classiied into ive categories and 24 subcategories; and our analysis. Since TravisTorrent infers the build job outcome by (3) broken builds that are caused by core contributors tend to be parsing the raw log, it is unable to detect the outcome of 794,334 ixed sooner than those of peripheral contributors. jobs (21.45%). Furthermore, TravisTorrent provides a single bro- Take-away messages. Build outcome data is noisy and hetero- ken category, whereas Travis CI records build breakage in three geneous in practice. If build outcomes are treated as the ground diferent categories (see Subsection 2.4). truth, this noise will likely impact subsequent analyses. Therefore, To satisfy our additional data requirements, we complement researchers should ilter out noise in build outcome data before con- the TravisTorrent dataset by extracting additional data from the ducting further analyses. Moreover, tool developers and researchers REST API that is provided by Travis CI. From the API, we collect who develop and propose solutions based on build outcome data the CI speciication (i.e., .travis.yml ile) used by Travis CI to need to take the heterogeneity of builds into account. create each build job and the outcome of each build job. To enable In summary, this paper makes the following contributions: further analysis of build breakages, we also download the plain-text logs of each build job in the TravisTorrent dataset. • An empirical study of noise and heterogeneity of build break- age in a large sample of Travis CI builds. 2.3 Clean and Process Raw Data • A replication package containing Travis CI speciication Since we focus on build breakages, we ilter away projects that do iles, metadata, build logs at the job level, and our data ex- not have any broken builds.